This is the beginning of a series of blog posts detailing some of the technology I’ve been playing with over the last year or so. These tools were used in the development of, primarily, two projects: Lensmob and Retrosift
Your prototypical web application does essentially two things. Take content from the user, put it in the database. Take data from the database and show it to the user.
When you start dealing with more complex data flows like processing emails, doing image resizing or other expensive operations, you need a data pipeline that is outside the request-response cycle of a web app. For Lensmob, I use Gearman. The basic outline of using gearman would be:
- Send a task to the Gearman Server
- Worker process grabs the task from the Gearman Server.
- Worker process does work, then deletes the task from the
As simple as this sounds, the operational realities of all the moving parts take a little more planning. This is especially true when in a cloud environment such as AWS where failure of a single machine can be commonplace. Also, when dealing with user supplied content such as emails, expect the unexpected. Failures of your software will happen, and having the opportunity to fix your bugs and try again is very important.
Here are some of the things I’ve done to manage this complexity, keeping us sane and providing a high-quality experience for our users.
The Server
Gearman does have the tendency to have quality issues with certain versions. Specifically I’ve seen issues with persistence which we find to be very important, but apparently the greater community does not. We’re currently running 0.41 which has been running great for us. We use SQLite backing store which gives us some reassurance around gearmand or the host machine failing. On EC2, we store the gearmand database file on a separate EBS volume to keep it isolated as well as simplify recovery procedures if that machine dies.
The Worker
My workers are all written in Python using a standard python library. A common gearman worker abstraction code makes creating new workers take just a few lines of code. This common code handles failures, logging and cleanup. I’m also careful to ensure each worker only handles one type of task, rather than fall into the trap of building one worker to handle all tasks. Balancing your resources is just too difficult if you combine workers. To avoid dealing with memory leak issues or version changes, we have our workers exit every couple of minutes. Most importantly, the worker can make decisions about what to do in the face of a failure, which leads us to…
The Job of Death
One of the worst things that can happen to your gearman workers is to encounter a job that kills the worker. If that job is requeued as-is, all your workers may constantly be killed off. This makes it difficult to process any normal, working jobs. Also, your error logs and notification mechanisms will be going nuts.
We handle this condition by allowing the worker some amount of smarts in the face failure. When a failure is detected in the worker, it can:
- Requeue the job directly, incrementing a counter in the task definition. This puts the job at the end of the queue, so other, working jobs get their chance first.
- If the counter reaches some configurable value, we put that task into a secondary queue we call “Gearmang”.
Gearmang is itself a worker that collects failed jobs and adds them to a second SQLite database. We then have command line tools for inspecting, requeuing or removing those tasks.
If we have a failure from a bug in the worker code, we can just fix the bug, then requeue the task. This is much preferable to your entire gearman infrastructure grinding to a halt while you try to deploy a fix.
Monitoring
The final piece of the puzzle is monitoring. As mentioned above, logging failures to a central logging system is critical to keeping track of your gearman infrastructure. In additional to failures, we use BlueOx to track our successful task processing. This allows us to add kinds of fun analytics to our pipelines such as task duration, image sizes we processed, etc.
In addition to log data, we also use Collectd to monitor our tasks queue lengths and worker counts. Using Nagios, we can then get alerted if workers don’t appear to be running or queues go above configured thresholds. Since our worker processes are supposed to exit every few minutes, we also have a check that ensures workers haven’t been running to long. This allows us to catch any hangs from blocked IO. The added confidence this monitoring provides can’t be understated.
The End?
This system isn’t perfect yet and there is still room for more redundancies and safety. However I feel this is a good enough effort without to much added expense either from computing resources or labor. I hope to continue to evolve this system, whether from experience or from feedback from you, dear reader.
I’ve put a Gearmang repository up on github if you want to see some of the code I’m talking about. The code in here was an attempt to pull it out of my main code base into something more re-usable. This project is far from done and I’m sure the code doesn’t actually run. But you can get an idea of what I’m talking about anyway.