Peeking Into Google

Throughout the past years we found out bit by bit how Google works internally. We know about the “Google File System”:http://www.zefhemel.com/archives/2003/09/28/google-file-system, for example. We know Google uses thousands of el-cheapo servers to serve us. We also know that Google’s software was built to deal with failure; if you run thousands of servers, stuff breaks all the time.

We also found out that Google is looking for additional uses of its super “we store the internet in memory” system; applications such as “Gmail”:http://www.zefhemel.com/archives/2004/06/26/the-gmail-experience, which offers 1GB of storage to its users. That’s a lot, but apparantly, because of its cheap hardware bought in huge amounts, a gig of storage costs Google around $1.

“Internetnews.com”:http://www.internetnews.com/xSP/article.php/3487041 now features an article which tells us a bit more about how Google works. When you want to start using Linux for example, many people would like to know what the big guys use. If it’s good enough for them, it must be good enough for you, right? Well, Google uses a stripped-down version of Red Hat:

All machines run on a stripped-down Linux kernel. The distribution is Red Hat, but Hoelzle said Google doesn’t use much of the distro. Moreover, Google has created its own patches for things that haven’t been fixed in the original kernel.

So, if you ever intend to run a thousand server-farm, you now know what this big guy uses. Not that it would help much, because the average-joe server farmer will know enough about operating systems to make his/her own educated choice. Nonetheless, Google uses RedHat. Hurrah.

Internetnews.com also tells us a little on how the programming for such a huge distributed system works:

To enable Google programmers to write applications to run in parallel on 1,000 machines, engineers created the Map/Reduce Framework in 2004.

“The Map/Reduce Framework provides automatic and efficient parallelization and distribution,” Hoelzle said. “It’s fault tolerant and it does the I/O scheduling, being a little bit smart about where the data lives.”

Programmers write two simple functions, map and reduce, to create a long list of key/value pairs. Then, the mapping function produces other key/value pairs. “You just map one pair to another pair,” he said.

For example, if an application is needed to count URLs on one host, the programmer would take the URL and the contents and map them into the pair consisting of hostname and 1. “This produces an intermediate set of key/value pairs with different values.”

Next, a reduction operation takes all the outputs that have the same key and combines them to produce a single output.

“Map/Reduce is simplified large-scale data processing,” Hoelzle said, “a very simple abstraction that makes it possible to write programs that run over these terabytes of data with little effort.”

The third homegrown application is Google’s Global Work Queue, which is for scheduling.

Global Work Queue works like old-time batch processing. It schedules queries into batch jobs and places them on pools of machines. The setup is optimized for running random computations over tons of data.

“Mostly, you want to split the huge task into lots of small chunks, which provides even load balancing across machines,” Hoelzle said. The idea is to have more tasks than machines so machines are never idle.

Read more about the Google internals in this “Internetnews.com article: Peeking Into Google”:http://www.internetnews.com/xSP/article.php/3487041.