For the past three months or so I have been working on my dissertation full time. I think I mentioned before that it was about context-aware semantic service matching on ad-hoc networks, and more specifically, developing middleware (APIs) to allow developers to do this more easily. I’m not going to talk about this project much (I don’t even know what I am allowed to say as we might be publishing on this). But I feel that after working with one particular technology for a few months now I should be able to define the significance of it. That technology is RDF and understanding what it does and why it matters has been the biggest challenge I encountered so far during this research. And still I cannot say that I fully appreciate its power. But I’ll try to give you at least a feel for why it matters.
The web is great. The web works. The web gives us loads and loads of information we’re interested in through tables, images and plain text. Pages are interlinked which allows us to easily jump from one page to the other. Fantastic.
But now let’s say you were recently hired by a software company that runs a big recruitement site. They list jobs and try to find good people for the jobs. Your assignment is to write a piece of software that spiders the web to find people that fit a particular job and create profiles of them. Stuff we are interested in are personal information like name, address, country, but also work experience and other stuff you usually put on your resume.
How would you do this?
As a normal web surfer this already is a challenge. I mean, how do you find a random person that fits a profile? Your best bet is to do a Google search on the job area and hope you’ll find some individuals. After that it’s not so hard anymore. Personaly websites and blogs usually list some personal information and a resume (in HTML or PDF format) and so on.
But how would you let a computer do this?
To be honest, beats me. The computer can only retrieve web pages and look at HTML code which doesn’t say that much. You can do some good guesses, but the information you can get from a free-form HTML page is always limited.
Why is it so hard?
The answer is the lack of semantics.
n. (used with a sing. or pl. verb)
- Linguistics The study or science of meaning in language.
- Linguistics The study of relationships between signs and symbols and what they represent. Also called semasiology.
- The meaning or the interpretation of a word, sentence, or other language form: We’re basically agreed; let’s not quibble over semantics.
(X)HTML’s semantic power is very limited. There are some tags like h1, h2, …, address, strong, em, that add a little bit of semantic information, but it’s not nearly enough. Not even close.
Let’s have a look at my very own about page. There is quite some information on that page that may be of interest to the application yoiu were asked to develop. My full name is there, gender, date of birth, occupation and some contact details. There’s also a link to a (bit outdated) CV. But can a computer understand this? Maybe a bit, I structured this information pretty clearly. It’s very possible to construct a parser that extracts the interesting information from this particular page. But we don’t care about me in particular, it has to be a generic solution. We’re not going to construct parsers for each way of writing a personal website, it would be more efficient to manually input all the data.
No, looking at the HTML code is pretty much hopeless. As mentioned we need semantic information. Statements about a person. For example, information like this would be much more helpful:
http://www.zefhemel.com hasName “Zef Hemel”
http://www.zefhemel.com hasGender http://someuri.org/genders#Male
http://www.zefhemel.com hasDateOfBirth “1983–06–22”
http://www.zefhemel.com hasOccupation http://www.cs.tcd.ie/courses/mscnds
http://www.zefhemel.com hasEmail “email@example.com”
You get the idea. If instead of HTML code we would get a string of statements like this, that would be much more helpful. Essentially this kind of information is really simple, it is just a bunch of triples in the form:
subject predicate object
Or less formal:
subject property value
Which is very much like writing object-oriented code:
subject.property = value
If only we had information like this, that would be great. And guess what? This is pretty much what RDF is. There are some small technicalities, which I’ll quickly explain, but essentially this is pretty much it. In RDF, subjects and predicates are all URIs (Uniform Resource Identifiers). URIs are different from URLs in the sense that they don’t necessarily identify Locations but are simply Identifiers, i.e. the “address” you supply with the URI does not really have to exist as long as it’s identifying (unique). The object of each triple can either be an URI (like in the hasOccupation triple) or a literal value (like a number, string, date and so on). So in RDF a triple really looks like this:
RDF has different so-called serializations, ways of writing it down. The most common one is RDF/XML. WordPress keeps messing up any HTML I insert here so I’ll link to a brief example instead.
The question is who defines the predicates/properties you can use (like hasName, hasGender etc.). The answer is you, you have complete freedom in this. In one way that’s very nice, in another way it causes some trouble.
If an application retrieves the above RDF file from the web somewhere it is possible to query it. One can ask “give me all objects where the subject is http://www.zefhemel.com and the predicate is http://someuri.org#hasName” and it would return “Zef Hemel”, so that’s handy. However who says that somebody else on another website used the same set of predicates? Maybe they didn’t use http://someuri.org#hasName but http://myuri.org#name. How can a computer know they mean the same? Well that is a problem, but it can be solved with inference rules. Somewhere on the web the fact should be stated that http://someuri.org#hasName and http://myuri.org#name are the same thing and therefore give you the same information.
Inference rules can be used for many other things, they can be used to infer new statements that weren’t obvious before. For example let’s say that in the semantic version of my resume it says that:
http://www.zefhemel.com hadJob #RuGJob1
http://www.zefhemel.com hadJob #OtherJob
#RuGJob1 hasName “System Administration”
#RuGJob1 startYear 2002
#RuGJob1 endYear 2005
#OtherJob hasName “Writing stuff”
#OtherJob startYear 2003
The fact that no endYear is specified means the job hasn’t ended yet; this person is still doing this job. So one could construct a rule like this:
?p hadJob ?job, ?job endYear ?ey, !bound(?ey) -> ?p hasJob ?job
(you can read the commas as logical ANDs here).
This rule says that if somebody has a job where the endYear is not speficied, this person still has this job. We extracted new information by using a rule.
One can imagine that when a degree is mentioned (like the http://www.cs.tcd.ie/courses/mscnds one), this URI is retrieved and it is checked if it contains any RDF information. It could contain information about the skills somebody has that completed this degree for example. By fetching related RDF resources, a lot of information could be extracted, useful information.
This is called the semantic web.
This looks like an utopian idea doesn’t it? Well it is. There are some problems with the semantic web, however. The biggest one being the amount of semantic data that is available today. For this to work a lot more data should be published in RDF and so far it’s not catching on that much. That’s a big issue. Yesterday I found a website called rdfdata.org that links to sources of RDF data, some quite interesting, like a semantic wikipedia, but we need much and much more.
Tim Berners-Lee, who invented the web and also invented the semantic web, has been fighting for the semantic web to catch on for many years, but with not that much visible result (you really have to look for places where it’s applied).
Maybe you can think about how RDF could be applied in your applications.