Because good research needs good data

Metadata mapping for the pilot UK Research Data Registry

Having set up the software that will power our pilot UK Research Data Registry, we are now writing crosswalks that will let us import metadata records from various UK data centres and institutional data repositories. We have mapped from five different metadata schemes to the registry's own...

Alex Ball | 24 January 2014

Back in October 2013, we told you about a project we are involved in to set up a pilot UK Research Data Registry. In short, we are piloting a service which would allow researchers to search for data across multiple UK data centres and institutional data repositories in one go. A lot of groundwork has gone on since then to get the platform ready, and we are now at the point where we can start to make things happen.

Our pilot registry uses the same software that powers Research Data Australia. Internally it uses a metadata scheme called RIF-CS to store information about datasets, repositories, researchers, funders, projects and so on. Meanwhile, the data centres around the UK all use metadata schemes tailored to their own needs and those of their user communities. Universities are setting up their own data repositories, and the metadata schemes they are using are still in flux for the most part. So how do we get the information on datasets out from these various sources and combine it so it can be searched, browsed and displayed in a unified manner?

The approach we are taking is to harvest metadata in whatever form the data centres and repositories can already provide, and convert it ourselves to RIF-CS using a crosswalk. While the registry software comes with this capability built in, it does not come with many crosswalks so we are writing them ourselves. This is a four step process:

  1. Matching. We compare the source metadata scheme with RIF-CS and work out which elements or attributes are saying the same thing in both schemes.
  2. Mapping. In some cases the information will be stored differently in the two schemes. For example, RIF-CS uses a controlled vocabulary to describe the subject classification scheme in use, while the DataCite metadata scheme allows subject classification schemes to be identified by a URI. In this case we need to find a way of transforming the information from one representation to the other.
  3. Coding. We turn our mapping into machine code that the registry software can use.
  4. Testing and improving. Different people and systems can interpret standards different ways, so the only way to know if a crosswalk is working correctly is to try it out and see if the output remains true to the input.

So far we have performed the matching stage and part of the mapping stage for five metadata schemes, in consultation with data archives and repositories using them:

  • DDI, as used by the UK Data Archive;
  • the profile of ISO 19115 used by the NERC Data Catalogue Service;
  • the DataCite metadata schema;
  • the EPrints ReCollect metadata profile, originally developed by the Research Data @ Essex project;
  • the OAI-PMH Dublin Core metadata profile (i.e. the profile supported by all OAI-PMH endpoints).

In February 2014, we will complete the mappings and encode them as crosswalks, after which we can start harvesting records. As someone who has promoted interoperability for many years, I am finding the prospect of this actual interoperation rather exciting.

You can keep in touch with our progress on this blog using the tag 'research data registry', or visit /projects/research-data-registry-pilot