Distributed Systems Week part I: What Are They?

Distributed systems, you hear a lot about them, but what are they? Why would you need them? Why do they matter? What kinds are there? How do they work? That’s what I’ll try to give you an overview of this week. In this first part I’ll talk about what distributed systems are and why you need them. Subjects I’ll cover in the coming days are the client-server model, peer to peer model, service oriented architecture and distributed computing.

*What they are*
Distributed systems are, as the name suggests, systems that are distributed over more than one machine. The amount of machines differs highly, in your LAN it might be just two, but it might as well just be thousands or many millions (because the Internet is a huge distributed system too).

Do you share folders with other computers on your LAN? Then you got a distributed system already. File servers are the most common kind of distributed systems. They’re particulary useful if you have a company with many PCs and want your employees to login into any of them and access their files. You can’t do that without a distributed system. How this is a distributed system? Because there’s one central place where files are stored and the applications that use those files run on another machine, the client machine. Each part of a distributed system has its own job. In this case the file server has the job of responding to clients requesting files and clients have the job of doing running applications that might need files from the file server.

That’s not so shocking, is it? There are much more distributed systems today, for many different reasons. One reason to distribute your system is separation of concerns. It’s like with object oriented programming. You create different objects that are more or less autonomous. Those objects all represent a particular part of your system and communicate with with other objects that have their own tasks. This way your system becomes managable. If something is wrong, you know where to look, as you know which part of your system is assigned a particular task.

*Concerns*
For distributed systems there are other reasons to be distributed aswell. The amount of resources available for example. If you have to store an enormous amount of data, think many terrabytes (1 terrabyte is 1024 gigabytes), you have to distribute them over multiple machines. You can’t fit that many disks in one machine. If you have to store petabytes (1 petabyte is 1024 terrabytes), like Google might have to, you really have no choice. As a matter of fact, it might be more cost efficient to use multiple machines with smaller disks than a factor two less with disks of double size (which are much more expensive). Another resource matter is CPU power. If a lot of calculations have to be done you can save time by letting multiple processors do the work (although this is not as easy as it may sound). Or bandwidth; if you run a download site the bandwidth costs can rise to extreme amounts. If you use peer to peer software like BitTorrent, you can distribute the bandwidth load.

Fault tolerance is another reason to distribute your system. What if your harddisk breaks down? And even if you use techniques such as RAID, what if your computer explodes, or your computer cabin burns up?

There’s a very interesting paper available online “about the Google File system”:http://www.cs.rochester.edu/sosp2003/papers/p125-ghemawat.pdf. It shows what challenges you face when developing a distributed system of Google’s magnitude:

[…] component failures are the norm rather than the exception. The file system consists of hundreds or even thousands of storage machines built from inexpensive commodity parts and is accessed by a comparable number of client machines. The quantity and quality of the components virtually guarantee that some are not functional at any given time and some will not recover from their current failures. We have seen problems caused by application bugs, operating system bugs, human errors, and the failures of disks, memory, connectors, networking, and power supplies. Therefore, constant monitoring, error detection, fault tolerance, and automatic recovery must be integral to the system.

Google uses thousands of servers to store all their data and handle the searches quickly. If I remember correctly there are three copies of each data file distributed over the servers. Each on a different rack, physically moved away as far from each other as possible. If some of the servers stops working or blow up, there are still copies of the data files available. Google’s system is even intelligent enough to automatically put another copy of all of the lost server’s data files on the other machines, so that, again, there are three copies.

In the next part I’ll talk about one particular model for distributed systems: the client-server model.