Because good research needs good data

Semantic Web of Linked Data for Research?

Chris Rusbridge | 24 July 2009

In the beginning was the World Wide Web. Then we were going to have the Semantic Web. (Then we had Web 2.0, but that’s another story.) But maybe the Semantic Web wasn’t semantic enough for some, so they changed the name to Linked Data, and it began to take off a little more. Now there’s an argument on whether all linked data are Linked Data!

The debate started with Andy Powell asking on Twitter what name we should use when all the conditions for Linked Data are met except for one, which was the requirement that data be expressed in standards, specifically RDF (see Andy's summary). Tim Berners Lee had suggested there were 4 principles for Linked Data:

  1. Use URIs as names for things.
  2. Use HTTP URIs so that people can look up those names.
  3. When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL).
  4. Include links to other URIs. so that they can discover more things.

There were quite strong divisions; one group says roughly: “Linked Data is a brand and a definition; live with it”, while the other group says something like “Linked Data can afford to be inclusive, and will benefit from that” (both of these are extreme simplifications). I’ve read all the remarks and they’re pretty convincing; I mostly agree with them (not much help to you, gentle reader!). Paul Walk's summary is quite balanced. However, I particularly liked a comment made on someone else’s blog post by Dan Brickley, who should know about RDF (quoted by Andy in the post mentioned above):

“I have no problem whatsoever with non-RDF forms of data in “the data Web”. This is natural, normal and healthy. Statistical information, geographic information, data-annotated SVG images, audio samples, JSON feeds, Atom, whatever.

We don’t need all this to be in RDF. Often it’ll be nice to have extracts and summaries in RDF, and we can get that via GRDDL or other methods. And we’ll also have metadata about that data, again in RDF; using SKOS for indicating subject areas, FOAF++ for provenance, etc.

The non-RDF bits of the data Web are – roughly – going to be the leaves on the tree. The bit that links it all together will be, as you say, the typed links, loose structuring and so on that come with RDF. This is also roughly analogous to the HTML Web: you find JPEGs, WAVs, flash files and so on linked in from the HTML Web, but the thing that hangs it all together isn’t flash or audio files, it’s the linky extensible format: HTML. For data, we’ll see more RDF than HTML (or RDFa bridging the two). But we needn’t panic if people put non-RDF data up online…. it’s still better than nothing. And as the LOD scene has shown, it can often easily be processed and republished by others. People worry too much! :)”

I think this makes lots of sense for research data. I’ve been wondering for some time how RDF fits into the world of research data. I asked the NERC Data Managers at their meeting earlier this year, and the general consensus appeared to be that RDF was good for the metadata, but not the actual research data. This seems reasonable and is consistent with Dan’s view above.

But it does rather raise the question about exactly what kinds of data RDF IS suitable for. It begins to look as if it is good for isolated facts, simple relationships and descriptive data. While RDF probably can encode most things you would put in databases or scientific datasets, generally it would be very difficult to express what those databases and datasets can express, and there would be a massive explosion of triples if one tried.

To answer Andy’s original question (what name…), although I was taken with the idea of linked data, it’s clearly too easy to confuse with Linked Data. So I think I’d go with Paul Walk’s suggestion of Web of Data, or interchangeably Dan Brickley's data Web. If we can weave research data into a Web of Data, we’ll be doing well!