Working with DataCite

11 September, 2012

What is DataCite?

“You could say that DataCite is to data sets what Cross Ref is to publications” explained DataCite Technical Lead, Ed Zukowski.

It’s not a perfect analogy but does give an idea of DataCite’s primary focus: it is a registration agency.

DataCite was founded as an organisation in 2009 by a group of international partners, which included the British Library. It aims to establish easier access to research data and build the profile of research data as legitimate contributions in the scholarly record. This is achieved through advocacy and by allowing researchers and data centres to assign persistent identifiers to datasets through their local DataCite Member.

DataCite supports the minting of persistent identifiers (DOIs), part of the larger handle global resolution system, and registration of associated metadata.

A Technical Introduction

Yesterday (Monday 10 September) I attended a practical workshop aimed at those considering incorporating DataCite services into their repository. It was held at the British Library Conference Centre, St. Pancras, London NW1 2DB.

The day was led by DataCite Technical Lead, Ed Zukowski, and provided both an overview of the DataCite technical infrastructure and opportunities to try out available services in a series of hands-on sessions.

Hands-on Exercises

The first part of the workshop took a look at the DataCite Metadata Store (MDS) and allowed us to create a metadata record and then register it.

This was a lot less tricky than I’d imagined and it wasn’t long before I had a few metadata records up in the sandbox test area. Those with more sophisticated programming skills had time to have a go at using the API and language/tool of their choice for upload.

My colleague, Andrew McHugh, recommended HTTPIE as one possible option for doing this.  It is clear that if you are responsible for a repository service you would need some kind of integration software for an upload of a large number of DOIs.

The second lot of exercises had us using the DataCite Metadata Search, a service that allows people to search metadata for datasets registered with DataCite. The DataCite search API is based on Apach Solr. The easiest way of obtaining a specific API call is to use the Metadata Search user interface and to select the API query for the data type you require. Custom API queries can also be built using the Solr common search parameters.

After this there was only time for us to concentrate on our own area of interest so I took a look at the Content resolution service. It exposes metadata stored in the DataCite Metadata Store (MDS) using multiple formats (RDF XML, RDF Turtle, JSON, RIS, BibTex and more). Data centres who participate in DataCite can define their own formats.

DataCite also offers other services I didn’t get time to look at: an OAI Provider, which exposes DataCite Metadata using the Open Archives Initiative Protocol for Metadata Harvesting, a Metadata Schema, which is  a list of core metadata properties used to identify data and more.

To Sum Up

The day was enjoyable and informative. I’d been worried that it would be too technical for me but everyone helped each other and you worked at your own speed. All the Datacite source code is all held on Github so if you are a programmer that might be a good place to start

There was lots of useful discussion during the day too.

For example a particular area of interest for one delegate was version control. DataCite currently holds different versions of metadata datasets but only one url is live for each DOI, this might sometimes cause problems.

It was noted that the UK Data Archive have process where they manually change the suffix (e.g. v1) and create separate DOIs for each version, they link to a landing page which links to all versions.

The workshop is one in a series of DataCite workshops, which form part of the dataset programme. The next feature-length workshop will be 'Managing sensitive data' held on Monday 29th October.

More about

DataCite