WebFS: a Web of Data

Interesting trends in data storage have been taking place in the past few years even though people often do not even realize it. Whereas we have been storing our data on disks locally for the past thirty years, slowly we are moving to storing more and more of our data remotely on the Internet. Examples of this are storing pictures on Flickr, videos on Youtube, e-mail on Gmail and contact information on Plaxo. It will not stop there, soon people will do office work (word processing, spreadsheets and presentations) on services like Google Docs and Zoho. This trend has major implications on data security, privacy and reliability, but also on ways we manage and manipulate our data. Whereas before all of our data was structured in a tree-like directory structure, currently our data lives all over the Internet, stored on different servers all over the world. Although we gain a lot from this move, such as having access to our data from any computer anywhere in the world, we also currently lose a lot: services offer different interfaces with different capabilities, it is easy to get data lock-in, meaning your data is not mobile, you cannot move it from one service to the other. The idea behind WebFS, the Web File System, is to define a uniform interface to data storage and bringing the data mobility and freedom of traditional storage to the emerging Internet storage.

Once there is a uniform way to access data it will finally become possible to also define uniform ways to manipulate this data. Manipulation of this data could happen through passing it through web processes. More or less analogous to the Unix command-line toolset, web processes are little web services that do one small task very well. This idea is very similar to the pipe and filter concept in Unix. Data is obtained from a data store, passed to a web process, then passed onto another one and so on, eventually the manipulated data could end up on a data store again. As an example one could imagine wanting to create a JPEG thumbnail of a RAW picture file. There are two web processes we could use for this: a image converter which converts RAW to JPEG and then a thumbnail web process which creates a thumbnail from the JPEG image.

With web processes we create little programs in the Unix tradition that do one little task well and can be chained to get a bigger task done. They are in essence web services, but they all conform to the same standard interface. This would not only make it easy for users to manipulate their data, but also for web developers to integrate into their applications. If a company like Riya would create a face-recognition web process, Flickr could pass its pictures through it to find out who is on these pictures. This could then be used to improve search results. If we look at other web application areas, such as online collaborative document editing, one could imagine a HTML to PDF converter web process, that users and application developers can use. Or web processes that aid in migration of data between services by converting between the different data formats of online spreadsheet programs or calendar applications (such as Google Spreadsheets and EditGrid or Google Calendar and 30 Boxes).

Before describing the architecture to make this work, let’s get a taste of what WebFS would enable people to do.

Vision of a WebFS-enabled world
Debbie no longer stores much of her data on her computer anymore, everything is stored on the Internet; she only keeps a small cache of it locally so she can also work offline. This is useful, because now she can access her data from any device. She sometimes listens to her music which she stored on MP3 Tunes and watches some videos she stored on Youtube on her mobile phone. Her phone uses a video conversion web process to resize the Youtube videos to the size of her screen. As she listens to her music, she can add tags to it or rate the songs, metadata that is persisted on the server.

When Debbie sits behind her laptop she fires up her data manager. In this data manager she sees a directory-like structure of all her data. It is possible to search through it quickly, as specialized search engines have been built for this purpose. She can search for all pictures she took between 2005 and 2007 in her favorite city in the world: Paris. The pictures have GPS location metadata associated with them, automatically added by her camera. This metadata was enriched by the “picture locator” web process before she uploaded the pictures. This web process finds picture GPS coordinates and appends city and country information to the metadata.

Even though the data appears as a big searchable tree-structure to Debbie, the data is actually stored on many different services. The Pictures folder, for example, lives on Flickr, although the Family folder inside the Pictures folder links to the family album she keeps on Zooomr. The Documents folder comes from Google Docs & Spreadsheets. Debbie uses Google Docs extensively to collaboratively edit document with her friends, however sometimes it’s easier to edit it in Zoho Writer. She right-clicks on the file and selects “Open in Zoho writer”, she edits the document and saves it. She sends the document to her friend, however she knows her friend prefers to receive all documents in PDF. Therefore she invokes a web process that converts the document to PDF before it is sent.

Because Debbie is a bit scared that her photo collection will some day disappear, she creates a backup of all her photo albums on Flickr every month. She has a folder that is called Backups. All the items in this folder are stored on Omnidrive. She runs a little program called “synchronizer”. Synchronizer compares the metadata of the items in the Backup folder and Pictures folder to see if anything has changed since it was last run. For this it uses special synchronization meta-data on each of the items. The program copies all changed and added pictures over from Flickr to Omnidrive.

Architecture
There are three main components needed to make this idea work:

  1. Data stores that retrieve and store data (through a standardized interface)
  2. Web processes that take data as input, plus some parameters and give the manipulated data as output (through a standardized interface)
  3. Applications that interface with data stores and web processes

Data stores
When we talk about data, we really talk about two things:

  • Content data: the data that represents the content (JPEG image data, Word document data)
  • Metadata: data about the data, such as title, author, date created, size, tags, but also data type specific meta data like image width, image height, location where it was taken and so on

WebFS-enabled data stores therefore need four fundamental operations on data items (such as contacts, documents and pictures):

  1. Get data: retrieve the content data of the data item from the data store
  2. Get metadata: retrieve the metadata of the data item from the data store
  3. Put data: store the content data of the data item in the data store
  4. Put metadata: store the metadata of the data item in the data store

Each data item has its own URL on which these operations are performed. The content data can have any form, such as JPEG for pictures, HTML for webpages and RTF for Word documents. Not every data store has to accept every type of data item. A picture service could only accept photos for example, forcing it to also store Word documents would not make much sense. A service like Omnidrive or Amazon S3 on the other hand would accept any kind of data type.

There is one special kind of data item, which is the folder. The folder fundamentally is yet another data item, but as content it has a list of links (URLs) to data items that are contained within it. A photo album (which would be a sub-class of a folder) contains a list of links to pictures that are in that album. Because a folder contains a list of URLs, which could link to anywhere on the web, there is no set hierarchy intrinsic in the URL. So http://someuri.com/folder could be a folder containing http://someuri.com/folder2. Looking at the URL structure these two folders seem siblings, but as URLs are purely identifiers in WebFS, the http://someuri.com/folder in this case is the parent of http://someuri.com/folder2.

For metadata it is desirable to look at technologies from the semantic web. One of the ontology languages, such as RDF (Resource Description Framework) or OWL (Web Ontology Language), seem an obvious choice to provide semantic metadata about data items. They both have schema languages to predefine metadata sets. Describing metadata in a semantic way is useful to allow reasoners to reason about it. For example, if you’re searching for pictures taken in Italy and there are pictures tagged with “Rome” and “Pisa”, and somewhere on the semantic web it is stated that Rome and Pisa are cities in Italy, it can be inferred that these pictures are indeed from Italy. This allows for very interesting new ways of searching data, which will get only more interesting as semantic web research evolves.

Every data store has to be able to persist any kind of metadata that the user desires. It can choose how it does this itself; some metadata will be stored inside the file, some will have to be stored separately. Jon Udell has a nice discussion on this issue. The ability to fully persist any kind of metadata allows loss-less backups. It would be possible to fully backup a photo album including its tags and comments and restore it without any loss of data, for instance.

Web Processes
Web processes only have one single operation: invoke. What will be passed to a web process is the following:

  1. Parameters: these can be compared to command line parameters, supplied by the user
  2. Metadata: the metadata of the data item
  3. Content data: the content of the data item (analogous to the standard input in Unix programs)

A web process then has two outputs:

  1. Metadata: the (possibly manipulated) metadata of the data item
  2. Content data: the (possibly manipulated) content of the data item (analogous to the standard output in Unix programs)

An invocation therefore looks as follows:

The big issue to be resolved with web processes is privacy. Users will often be sending private, maybe even confidential data to these processes, how can they be sure that processes will not store this data themselves and hand it to third-parties? This is an important issue that has to be resolved, however in practice it is likely to come down to trust. Only use web processes where you know who built them and what their privacy policy is.

Applications
Because WebFS makes data storage on the Internet transparent, it also becomes completely unimportant where data is stored. Application and storage can be separated with WebFS as their interface; just like local file storage works now: your operating system manages storage and applications use that storage through operating system APIs.

Web applications currently usually store the user’s data themselves, but with WebFS there is no reason to have to do that anymore. The example mentioned of editing a document stored on Google Docs in Zoho writer could be applied to any other application. If a user has an Omnidrive account, which can store any kind of data, a web application could just use Omnidrive’s WebFS interface to store the user’s data on his or her Omnidrive. Something like this is already happening with Zoho and Omnidrive, but through Omnidrive’s proprietary API. Photo editing services such as Preloadr and Picture2Life interface with sites like Flickr to store and retrieve photos to and from. A web application in this scenario simply becomes a front-end. Web applications would actually compete based on their feature-set and ease of use, rather than the fact that the user’s data is locked into their service and therefore cannot switch anymore. This is much healthier for both the user and the application vendors.

Companies whose core business is storing data and building web interfaces to this stored data (such as Omnidrive, Xdrive and others) can now also integrate other WebFS data stores into their product. So I could use Omnidrive’s built-in MP3 player to play music that I have currently stored on MP3 Tunes or Amazon S3; and move data between different WebFS stores. These applications would function as the file managers of the Internet.

Where to move from here
WebFS is no particular technology or standard at this point. Currently it is an architecture with some implementation ideas open to discussion. A few days ago I got an e-mail from a CEO of an important web storage company as a response to my previous post on WebFS. He is working on getting support among companies to come up with an open standard for data storage and hoped I could help out. My previous post on WebFS was rather brief, so I thought it would be a good idea to first outline my vision of how this would work and what it would enable.

Even if web applications vendors do not start supporting WebFS immediately, it is an option to implement wrappers for them. It would not be very difficult to create a WebFS wrapper around the Flickr APIs for example, or around Amazon S3.

Conclusion
The idea behind WebFS is simple yet powerful. It brings the idea of uniform storage and pipes and filters to the Internet. This brings great advantages to consumers, because they regain data mobility and freedom to do with their data whatever they want. It can also bring advantages to web application developers because they can choose not to worry about data storage anymore and purely focus on their application itself. For WebFS to work, standards will have to be created and agreed upon, but when this happens it will be a great step forward for the Internet.