When you run a big mission critical web app, you have to two big challenges on the operational side: 1. Handling _failure_ -- failure is all around: servers fail, processes crash, datacenters blow up. 2. _Scaling_ as traffic increases and decreases.
A tool in use by practically everybody to solve both of these challenges are load balancers. Here's how most of them work:
In this architecture, the _load balancer_ is the single entry point for the client (often a web browser). The load balancer keeps a _registry_ of _servers_ that can handle a request. Usually this registry is kept up to date more or less manually (or via scripts). If a new server should be added to the load balancing pool, an API call is sent to the load balancer. Similarly, when a server has to be brought down (e.g. for upgrading or scaling down) it has to be removed from the pool manually.
Whenever a client request comes in, the load balancer picks a server from the pool (usually at random or [round-robin](http://en.wikipedia.org/wiki/Round-robin_scheduling)) and proxies the request to it. A response from the server is proxied back to the client through the load balancer.
This system scales nicely because the load balancer itself has to do very little, all it does is proxy requests. Load balancers like [HAProxy](http://haproxy.1wt.eu/) can handle _tons_ of requests (in the order of tens or hundreds of thousands per second) and nicely distribute them over a large number of servers.
The question remains how many worker servers you need to handle a certain amount of traffic. How do you know if you have to scale up or down? There's a few things you can look at, for instance response times, CPU load or memory usage of the servers. It does remain a tricky problem, though.
Effectively every big site uses these types of load balancers (be it in software or hardware). They are so prevalent that most cloud providers offer pre-built appliances for them, such as Amazon's [ELBs](http://aws.amazon.com/elasticloadbalancing/) and Rackspace [Cloud Load Balancers](http://www.rackspace.com/cloud/public/loadbalancers/).
While load balancers are great at helping a site scale, handling failure is a bit more tricky. What if one of the servers registered with a load balancer fails? Requests sent there will simply never return a result. To deal with this problem, most load balancers periodically poll all the servers in the pool and eject servers that repeatedly don't respond to requests.
Ejecting a failing server is not instantaneous, however. A load balancer usually retries a server a few times to make sure it's really down before ejecting it. Requests that were proxied there while the load balancer had not detected failure will likely just fail.
So, we can identify three problems/inconveniences with the current state of practice in load balancing:
1. Management of the registry with the pool of servers is manual, we have to manually (or via scripts) add and remove servers from the pool. 2. It can be tricky to know if you need to scale your system up or down. 3. Failure detection happens based on polling and take time to detect failure.
**An alternative approach** So, here's another approach -- an approach based on good old [message queues](http://en.wikipedia.org/wiki/Message_queue). Schematically this architecture looks as follows:
In this architecture the model is slightly different. The flow is as follows:
1. The client sends a request to a load balancer 2. Instead of proxying it to a server directly, the load balancer puts the request on a message queue. 3. One or more servers are subscribing to this request queue and one of them pulls in the message containing the request. 4. After the request is handled and response produced, the server puts the response on the response message queue. 5. The load balancer is subscribed to the response queue and forwards the response back to the client.
So, why is this a more attractive approach than traditional load balancing?
From the server perspective it's a pull model. When a server boots up, it subscribes to the request queue, so there's no manual registry that has to be kept up to date. The server _itself_ decides whether it has the capacity to handle a request or not. It will only pull in a request message when:
1. It's up (duh) 2. It has the capacity to do so (it is not overloaded) 3. It's not hiccuping
Compare this to the traditional load balancing approach where requests are pushed to servers whether or not they're capable of handling them.
The message queue-based system handles failure beautifully as well. If the server crashes while handling a request, the message queue will detect the lost connection and relay the request message to another server: no failed requests.
It's also easier to detect whether you need to scale up your system or not. You can easily see when your servers are having a tough time handling all requests when the request message queue is growing instead of staying steady. It's trivial to add a message queue watcher that scales your system accordingly: if the queue goes over a certain threshold, you spin up some extra servers to handle the load. Similarly, if the request message queue is constantly empty, it may be possible to kill one or more servers.
Message-queue based load balancers bring some other advantages to the table:
* Message queues usually support topic filtering, enabling flexible ways of routing requests: the server itself can decide to only handle requests to certain paths or subdomains, for instance. * Decoupling of request logging and request analytics (e.g. response times) from regular request handling becomes much easier by multicasting request and response messages to multiple queues (one for regular request handling, one for logging, one for analytics etc.)
At this point you'll be asking: "So, what's the catch? There must be something wrong with message queue based systems, right?"
Everything comes at a price.
As you can tell from the two diagrams I sketched, the message que-based model is more complex. Instead of having to worry about just a load balancer and servers, now you have to scale and failover a message queueing system as well. Clearly, this is more complicated, but not impossible. Let's not forget message queues have been around for decades. Scaling and failover strategies for message queues are well known and you don't have to build them yourself, there are some very solid implementations freely available, for instance [RabbitMQ](http://www.rabbitmq.com) or the more low-level [ZeroMQ](http://www.zeromq.org).
Another drawback is increased request overhead and latency due to the extra level of indirection. It's hard to say how much latency will result -- it depends on the implementation and message queue used.
**Mongrel** All excited and want to start playing with a message queue-based web server today? Have a look at [Mongrel2](http://mongrel2.org) -- it's the only (open) implementation of this idea that I know of right now, if you know of others [please let me know](mailto:firstname.lastname@example.org).
Mongrel2 uses ZeroMQ as its message queue and seems pretty damn fast. Mongrel2 is completely language agnostic due to the message queue-based decoupling. From the looks of it, it's a pretty awesome system -- definitely worth a look.