The Share-Nothing Architecture

Although I used to think otherwise, PHP scales very well and probably easier than other platforms like Java, ASP.NET and Perl (using mod_perl) and I’ll tell you why. But first I’ll tell what scalability is and how it’s different from speed, as they’re often mixed up.

People tend to say that because their product is really fast, it scales well. That’s wrong. Scalability is “the capability of a system to increase performance under an increased load when resources (typically hardware) are added.” (source: Wikipedia) In other words, will you be able to handle the slashdot load when you add another processor and double the memory? Will you be able to handle roughly twice as many concurrent requests when you set up a second server? Will your application even work when you duplicate your servers? So even if your application’s pages take 2 seconds to load, under no load, and your competitor’s load in half a second, that does not imply that your competitor’s scales better.

When I chose Java to develop KeyTopic in, I did so because I could finally do some interesting things with caching data. The cool thing about servlet containers, where Java web applications live in, is that it’s very possible to keep objects persistent in memory through requests. So it’s possible to have, say, a in-memory counter that is increased by 1 on every request. How fast and efficient is that? ASP.NET and mod_perl (and probably others) have similar features. Using this techniques I could just cache the data of all categories and boards in memory, so I could lower the amount of queries needed to the database. There’s one problem with this though and you may guess what it is… Yes, indeed, it’s scalability.

A big scalability problem with caching data is called the cache-coherence problem. What if one server can’t handle the traffic anymore and a second one is added. This can simply be done by putting a so-called load balancer as main entry. The load balancer will spread the requests over all servers that are available at that time. This way the load is balanced.

Let’s look at the ordinary request flow in a one-server situation:

Now add another two servers and put a load balancer in front:

Each server has its own cache. That’s fine if you only got one server, but if you got two or more, things can get nasty. When somebody posts at the first load-balanced server and the number of posts on a board is increased only the first server’s cache knows about this. All people whose requests are handled by the second and third server still get sent the old post count: The caches are not equal, they’re not coherent. In this scenario this is not disasterous and could be solved by a simple cache flush every, say, minute (or more, depending on the traffic), yet it can cause serious problems in many other occassions. For example, imagine a networking file system. It would be very efficient to cache some files locally that are accessed a lot, but what if somebody else on the network changes those files? You would still be using or editting the old ones. This can only work well with systems where files are only read, not written, or at least not often. What seemed to be a performance gain, appears to become a major scalability issue when used inappropriately.

The solution is the shared-nothing architecture, which I first heard of in an interview with Rasmus Lerdorf (the initial creator of PHP). What does this architecture involve? Simply not sharing anything. No shared data, at least within the webserver. Sessions are shared, but are shared through the filesystem or database, which can easily be scaled by using a networked filesystem (NFS, SMB) or by using a database. As no cache or whatsoever exists in a server you can copy it virtually as much as you like. At a certain moment the database server might become a bottleneck of course, but databases usually scale very well. MySQL might not, but Oracle scales to unimaginable proportions, so as long as you have enough money, you can scale as far as you want (virtually).

So, PHP scales better than Java, ASP.NET and mod_perl? No, not necessarily, but PHP scales by default. PHP allows no sharing of data, no caching within the process (unless you explicitly enable those features). Java, ASP.NET and mod_perl do allow sharing of data and so you have to make sure that you know when you use it and how to handle it in your situation. In other words: It’s easy to write non-scalable software on those platforms.