WebFS: the Case for RDF

One of the powerful parts of WebFS, as I see it, will be its metadata capabilities. Metadata is data describing data. For instance metadata about music can be artist information, song title, album, ratings and so on. For Word documents this can be author, date written, number of pages, number of words and so on. Traditionally file systems support a fixed set of this kind of data. In many operating systems these are file name, file size, last time change, time created, owner and permissions. Recently however, files system are getting richer when it comes to the kinds of metadata they can store. This is caused mainly by the recent rise of the desktop search engine. Desktop search engines index the full text of word documents, powerpoint presentations and e-mails, but also their metadata individually. Results of searches improve as metadata of files improve. For example it would be possible to search for all songs, written between 1998 and 2002 by Madonna, which are longer than four minutes, which without proper metadata indexing would not be possible.

However, how does one represent metadata properly? A simple way is to use name/value pairs. For example:

title: What's the Story
author: Oasis
bitrate: 192
length: 4:52

However, this has some issues. If applications want to create their own sets of metadata that they can associate with files you can run into namespace issues. One application may use a metadata attribute called "length" for the number of pages, another for the length of a song in minutes. An XML format, which has support for multiple namespaces, would therefore be a better solution. There is one XML format in particular that was designed and proved very apt to represent metadata, it is called RDF: Resource Description Framework. RDF was developed as a foundation for the Semantic Web. The Semantic Web is an effort to represent data on the web in such a way that computers understand them better, so that they can reason with information on the web. Whether you are a semantic-web believer or not, RDF is a pretty nice format to describe resources. RDF, as the name implies, is a framework for describing resources. Resources are identified by URIs. Essentially an RDF document is nothing more than a list of triples: (resource, property, value) serialized in an XML format.

<?xml version="1.0" encoding="utf-8"?>
   <rdf:Description rdf:about="http://www.zefhemel.com">
      <dc:title>Zef Hemel</p:title>
      <dc:author>Zef's mom and dad</dc:author>

There is an extension to RDF, called RDFS, which, essentially, adds a typing system to RDF. You can define classes and their properties, you can define sub-classes and so on.

So how would we apply RDF(S) to WebFS? Before I attempt to answer that question let's first define some terminology.

Personally I think we should abstract from the idea of dealing with "files" in a file system. I would prefer the term "item", or "data item". Also I would like to use the term "container" for what is in a traditional filesystem called a "folder" or "directory". The reason is that in WebFS it does not really matter how things are stored in the actual local file system, or if they're stored on a filesystem at all (they may be stored in a database). It matters that they are data items. An e-mail is a data item, even if you store a thousand of them in one file -- the file is a low-level concept that I would like WebFS to abstract from. My problem with the terms directory and folder are that they are very specific to documents. You put a document in a folder, but do you put music in a folder, or pictures? No, you put them in an album, or more generally: in something that contains something else -- a container.

Alright, back to the question about how to use RDF within WebFS.

I propose to do this in the following way. Every item on WebFS will have its own, unique URI. Every item has a set of metadata associated with it. This metadata is represented in RDF. WebFS itself defines a base RDFS class called "Item" and a sub-class of that, called "Container". Every data item will be of type Item or a sub-class of that. "Container" is a special kind of Item, namely one representing a set of other Items.

Then a whole bunch of Item sub-classes can be defined, some inheriting from Item directly, others from Container. Some examples could me: Music (inheriting from Item), Photo (inheriting from Item), PhotoAlbum (inheriting from Container) and so on. For these classes properties would be defined. We would also predefine some properties for the Item class, for instance: title, author, date created, permission, content type -- the general set of metadata that would be valid for any kind of data item. Then for a Music class we could add properties like: album, year, song writer and so on. An example WebFS RDFS class hierarchy:

+-- Container
|   +-- PhotoAlbum
|   +-- MusicAlbum
+-- Music
+-- Video

We could standardize a set of standard Item types (such as Music, Video, Photo, PhotoAlbum). However, applications should be able to define their own types of metadata types too. A synchronization application should be able to add properties to current items for example (which RDF allows). An application called, let's say, tagger for example may allow their users to tag any kind of data item they own. RDFS is very flexible in allowing this.

An example RDF description of a data item (plus some made-up synchronization metadata):

<?xml version="1.0" encoding="utf-8"?>
  <webfs:Image rdf:about="http://zuzia.local/~zef/webfs/tmpfs/iceskating.jpg">
       <sync:SyncRecord rdf:about="sync:http://zuzia.local/~zef/webfs/tmpfs/iceskating.jpg">

To make all this work there is one very important thing that every WebFS storage provider should adhere to: all metadata should be persisted. A WebFS storage provider should be able to persist an arbitrary set of metadata for each item, even if it does not use the metadata itself. The reason for this is to allow (client) applications to define metadata on items for their own usage, but it also means that the user could copy their photo albums on Flickr to, let's say, OmniDrive and back without any metadata being lost.

Creating a metadata system with RDF is simple yet powerful. There are RDF libraries available for practically any platform. Once we establish this system we can see what the metadata system can be used for beyond what is described here. I can imagine using it for the permission system as well. We could define a "canBeReadBy" and "canBeWrittenBy" property on items, define who has read and write permissions to a particular data item.