A Better Diff/Merge

I like version control systems. Since I got my VPS to run my own server stuff on I’ve been using “Subversion”:http://subversion.tigris.org to store many of my files in. This is not only great for when I screw things up and want to retrieve an older version of a file, but also for synchronisation between different computers. Version control systems pop up in many pieces of software, in wikis for example. Version control in wikis works great, do you know why? Because wiki pages are plain text documents.

One of the most useful features in version control systems are diff and merge. Diff (for difference) compares two versions of the same file (an old one and a new one, usually) and shows you the differences. Merge can then apply the output of the diff to the old version which will result in the new version (a diff is like a delta). This becomes really useful if two people are working on the same file at the same time, but in different parts of the file. One is editting one section of the file, somebody else is editting another. When both check their changes in the version control systems compares (diffs) each version with the current one in the repository and if they don’t conflict it can merge them both in; so nobody’s work is lost.

But this only works with text files. If you would do a diff on a word document or JPEG image it won’t work. Why not? Because the diff and merge tools compare files on a line-by-line basis. Even it would work on a byte-by-byte basis it wouldn’t work right because changing one pixel in a JPEG image can change the whole file around.

This is a problem also raised by “Tjaard”:http://www.tjaard.nl/2005/05/21/smart-version-controlling-why-diff-is-just-not-enough. He argues that version control systems should become more file-type specific. Right now there are basically two kinds of files: text files and binary files. That’s it. A while ago I went to a graduation talk of somebody at our university who researched diffing and merging UML diagrams. That’s the kind of stuff I’m talking about. A version control system should have plug-ins for different file types. For word files, for JPEG files, for UML diagrams, for XML files.

Tjaard asks if this wouldn’t cause performance problems. All I can say is: screw that. Even if you need a bigger server for version control it saves so much (wo)man hours and productivity that it’s totally worth it.