Sittin’ on the CouchDB

For the past few days I’ve been dabbling with CouchDB. Trying to figure out what it can do and how it’s different than traditional relational databases. According to the site:

CouchDB is designed for document-oriented applications. A typical real-world document oriented activity, if it weren’t computerized, would consist mostly of physical paper documents. These documents would need to get sent around, edited, photocopied, approved, denied, pinned to the wall, filed away, buried in soft peat for six months, etc. They could be simple yellow sticky notes or 10,000 page legal documents. Not all document-oriented applications have real world counterparts.

Some examples of document-oriented applications:

  • CRM
  • Contact Address/Phone Book
  • Forum/Discussion
  • Bug Tracking
  • Document Collaboration/Wiki
  • Customer Call Tracking
  • Expense Reporting
  • To-Dos
  • Time Sheets
  • E-mail
  • Help/Reference Desk

Looking at this list I’m like, what application is not document-oriented? It seems that the applications I use: address books, email, blogs, twitter, calendar and so on are all document-oriented applications. So, I decided to look a bit deeper. What makes CouchDB different than, say, MySQL? This presentation gave me the best answer to that question:


SQLCouchDB
Predefined, explicit schemaDynamic, implicit schema
Uniform tables of dataCollection of named documents with varying structure
Normalized. Objects spread across tables. Duplication reduced.Denormalized. Docs usually self contained. Data often duplicated.
Must know schema to read/write a complete objectMust know only document name
Dynamic queries of static schemasStatic queries of dynamic schemas

CouchDB seems to tell you: forget everything you learned about database design and be pragmatic. Don’t normalize — aggregate, don’t plan ahead — evolve. To play around with these ideas I decided to port my blog to CouchDB. It’s not really done yet, but I moved over most of the data and have a basic index page view now. Let me tell you how the approach I took with CouchDB differs from the one taken for SQL databases in WordPress.

Unlike MySQL and other SQL databases, you access CouchDB through a web service API, a RESTful service API in fact. The protocol is extremely simple and that’s also why there are client “libraries” in about every language imaginable. It also comes with a convenient browser admin interface that allows you to create new databases, create documents, edit them, remove them and so forth. This is what my blog database looks like (click for a larger version):



As you can see every document has a Document ID, this is similar to a primary key in SQL. When you click on a document you will see the content of that document (click for a larger version):



As you can see a documents consists of a number fields with associated values. What field names you use is entirely up to you, but there are a couple that have a special meaning. Every document has at least two fields: “_id” (which contains the document’s ID) and “_rev” (which contains the document’s revision number). Indeed, CouchDB keeps old revisions of all your documents, which is really cool and useful. It means that it’s almost trivial to produce a revision history of your blog posts, or if you implement a wiki system with CouchDB — a revision history of pages is very simple to obtain.

A field’s value can be of any JSON type, so: a number, a string, a list or a hash table.

In my blog application I defined a number of other fields, they are:
People who ever designed a database schema for a relational database will look at the tags and comments fields and think: what the? In SQL databases this is absolutely not-done, partly because you don’t usually have list types, let alone lists of hashtable types in SQL databases, but also because it’s not normalized and very hard to query.

Now I will admit, this denormalization has its problems. For instance, if I change the name of a tag I would have to run through every single post and change its name — very inefficient, whereas in a SQL database there would be a “tag” table with a tag ID and tag name, and I would simply change the name. Also, it’s not very efficient to store data in this manner, because you have data duplication all over the place. However, querying this nested data is not a problem, because CouchDB has views.

A view in CouchDB is yet another document adhering to a couple of conventions, first of all, its Document ID should start with “_design/” and the document should have a “views” field with a hashmap that maps view names to Javascript functions (click for larger version):



Now let’s have a look at the two views that are defined here. The first one is the simplest one: “latest_posts”. Here is the Javascript code:

function(doc) {
  if(doc.type == 'post') {
    map(doc.date, {'title': doc.title, 'author': doc.author,
                   'content': doc.content, 'tags': doc.tags,
                   'comment_count': doc.comments.length})
  }
}


The idea with views is simple. You provide CouchDB with a function in some language (Javascript out of the box, but other languages can easily be supported, there are means to write those functions in Python for instance). This function takes one argument: the document. The function decides whether or not this document will be in the view, and if so what the key will be and what shall be the contents of the entry for this document. They key can be used to sort and be filtered on. The content can be any JSON type, but typically it’s a hash map (as in this case). The function to call to tell CouchDB what to put in the view is called map. Now, if you have done some functional programming this will seem odd to you, because typically map is a function that applies some function to a list of values. In this context however map refers to acting as the map bit of the map/reduce algorithm using which CouchDB is implemented. Personally I’m not a big fan of map as a name, as users don’t really care about map/reduce, they care about what map does. In the Python support this function is using Python’s yield keyword, which, in my view, is more descriptive of what it does.

Now you might wonder, why use something like map or yield, why not simply use return? Well, the interesting thing is that this view function doesn’t have to return 0 or 1 view entry, it can return any number of view entries. In the “latest_posts” example this didn’t make much sense. But let’s have a look at the view function for “latest_comments”. As you will remember comments are not separate documents in my blog model, they are contained in a field of a post document. Now how would you retrieve a list of the latest comments? Obviously the answer is a view, and this is what it looks like:

function(doc) {
  if(doc.type == 'post') {
    for(var i = 0; i < doc.comments.length; i++) {
      var comment = doc.comments[i];
      map(comment.date, {'post_title': doc.title, 'post_id': doc._id,
                         'author': comment.author,
                         'content': comment.content});
    }
  }
}


What happens here is that for every comment in every post document, map is called returning information about the comment and using comment.date as the key so that it is possible to sort based on that.

Using these two views you already implemented quite a bit of the application. To create a post, simply add a document. For the front page simply use the “latest_posts” view, for the individual post page simply retrieve the document, which contains all the information you need. If you want a list of latest comments in the sidebar, you can use “latest_comments” view. Put a simple frontend to it, and there you go, a CouchDB powered blog (click for bigger version):