Distributed Systems Week part III: The Peer To Peer Model

Peer To Peer (P2P) networks are hot these days. The most early famous example of a P2P network is Napster, but Napster still had a client-server vibe to it. Gnutella was software that did the same thing, but fully decentralized, so without any central server; its problem was performance. Kazaa made another attempt and developed a way to make decentralized P2P networks perform better. BitTorrent is another way of distributing files using P2P technology that’s quickly gaining momentum at the moment. In this part of the Distributed Systems series I’ll try to explain how these P2P networks work.

First there was Napster. When you logged in using your Napster client you’d send a list of all MP3 files you owned to the central Napster server. When you wanted to search, you sent a search query to the Napster server and it would return hits from its MP3-file database. This was all done using the traditional client-server model. The peer to peer part started when you downloaded a file. The idea of peer to peer networks is that everyone connected to the network is both a server and a client (in Gnutella terms this is called a servent). There’s no hierarchy, everybody’s equal; we’re all peers. The peers in a network are also called nodes. After sending a search request to the Napster server, the server would return a list of nodes (in the form of IPs) that had this file. If you wanted to download such a file you’d connect to the node in the network with that particular IP and would request the file. It’s as simple as that.

The problem with Napster was the need for a central server; a central server that can be controlled and shut down. And that’s what happened with Napster, they shut it down. Somebody from Nullsoft (most well-known from Winamp) developed Gnutella, which was fully decentralized P2P file sharing software. However, as Nullsoft was taken over by AOL, they had to take the Gnutella software offline. The protocol that it used was reverse engineered by other people and is now used by many applications like “BearShare”:http://www.bearshare.com and “Shareaza”:http://www.shareaza.com.

The “Gnutella protocol”:http://www9.limewire.com/developer/gnutella_protocol_0.4.pdf is not very complicated. In order to connect to a gnutella network you need the IP of one or more nodes that are currently connected. Many of the gnutella clients come with a list of IPs of nodes that are usually online. Once you’re connected to a node you start receiving messages that you’re supposed to respond to and pass on to your neighbour nodes. You can also send messages yourself. In Gnutella there are five kinds of messages:
# *Ping*, you send this message to discover nodes in the network. Every node in the network that receives such a ping message is supposed to respond to it using a pong message. These pong messages are routed back to the node sending the original ping message but can be read by nodes that it is passed by. By intercepting pong messages you can add more node IPs to your list of IPs to connect to. To make the network stronger you are encouraged to connect to more than one node.
# *Pong*, this is the message that you reply to a ping message with, to make yourself known.
# *Query*, if you want to query for something, for example search for a file, you send a query message. Each node is supposed to respond to this message with a QueryHit message (if it has matching files) and pass on the query message to its neighbour nodes.
# *QueryHit*, the message to send back to the sender of a query when you got matching files.
# *Push*, this message initiates a connection with a firewalled node to make file downloads from firewalled nodes possible.

To stop messages from being passed through the network forever they contain a TTL (Time To Live) tag. On every node the messages passes (called a “hop”) this number is decreased by one. When the number is zero the message is discarted. Let’s assume you choose 3 as the TTL for your message. The message flow would look something like this (the initial message is sent by the node at the top left, red arrows show the message flow, lines mean connections between nodes):


With a TTL of 3 some nodes aren’t reached. How many nodes are reached greatly depends on how the network is connected. Much research is done on how to optimize networks to pass on these messages as efficiently as possible.

The trouble with the Gnutella network is that it takes a lot of time for your message to pass through the network. When you search for some file it can take ages before you get useful response. Another problem is that every message has to (ideally) be sent to every single node. Therefore the network is quite bandwidth-intensive. That’s where Kazaa jumped in.

Kazaa works similary to Gnutella. There’s one major difference: the concept of supernodes. Some peers are more equal than others, so to speak. Supernodes usually are nodes in the network that have more resources at their disposal (bandwidth, memory, harddisk space). They function as a central node in a segment of the network. They cache the list of files of the nodes connected to them. If a node wants to search for a file, they query the supernode which can give responses that represent their segment of the network. They will then pass on the request to to their neighbour supernodes which will respond representing their segment, and so on. This is a much more scalable and efficient method than the Gnutella one (the red nodes in the image below are supernodes):

Supernodes in P2P

Which become supernodes is decided dynamically. Many factors can influence who becomes a supernode. The bandwidth of the node, the amount of free memory or diskspace or the structure of the network.

At this moment BitTorrent is becoming more and more popular. Not only because it’s yet another way to distribute your illegal movies efficiently, but it has proven to be very useful for legitimate purposes aswell. Many Linux distributions are distributed using BitTorrent today. Using BitTorrent to distribute big files saves a huge load of bandwidth on the FTP server that would otherwise be used to serve these files.

BitTorrent works similary to Napster. For each “torrent” there’s a server that tracks who currently has this file and who’s downloading it. When you want to download a file through BitTorrent you go to a website that offers a .torrent file for it. This .torrent file contains the URL of the tracker and checksums for the different parts and pieces of the file you’re about to download. These checksums can be used to see if the part you downloaded came through correctly. Your BitTorrent client will then connect to the tracker asking who’s currently offering (parts of) the file. Your client will then contact a couple of those nodes in the network and start downloading parts. At the same time other nodes will contact you and download parts from you that you have already downloaded from others. This works OK as long as there are nodes in the network that have a full copy of the file. These nodes are called seeds. Lack of seeds is the most common problem with BitTorrent, if there’s no seed, and there’s no distributed full copy of the file, you’ll end up with only a part of your file. Therefore BitTorrent encourages to keep your client running after you fully downloaded the file.

*The firewall problem*
There’s one problem with P2P networks: the firewall problem. I “wrote about this before”:http://www.zefhemel.com/archives/2004/09/21/skype-and-firewalls, and if you have already read that article you can safely skip this section. Just to keep this series self-containing I’ll quote the relevant piece here again.

What is the firewall problem? Many firewalls, by default, only allow outgoing connections. This means that connections to servers can only established by you and not the other way around. If another user or server tries to contact you, your firewall will block it. That’s unless you open up certain ports. If you’re using a router and connect to the internet using NAT, it’s even harder to make incoming connections possible. You’d have to tell the router to map a certain port to a certain PC. This requires router configuration. Also, that port on the router will always point to a certain PC, so other if you want to use the same applications on more PCs in the network, you will need to use different ports. It is obvious that you can’t ask the general computer user to configure their firewalls or routers in order to allow incoming connections.

Why do you need incoming connections so bad? In peer to peer networks another node wants establish a connection to you (another node in the network), or the other way around. That won’t be a problem if at least one of the two is not behind a firewall. The other person could just connect to the firewall-less user. But what if everybody had a firewall? Nobody could connect to anybody, without firewall or router configuration. That’s the firewall problem. So far I haven’t heard of a good solution for this problem. As long as there’s a couple of nodes in the network that allow incoming connections, you could route traffic through those, but depending on the amount of traffic, that’s not a very friendly solution.

In the next part of this series I’ll talk about the service-oriented model.