Because good research needs good data

Data citations: versions and granularity

On 3 December 2012, the British Library ran a DataCite workshop entitiled 'What to cite: Versioning and granularity of research data for effective citation'. I went along to hear how data centres, institutional data archives, publishers and researchers are making different versions of...

Alex Ball | 11 December 2012

Citing datasets as you would a journal article or a conference paper is nowhere near as straightforward as it sounds. That's the reason why we at the DCC produced a briefing paper and how-to guide on the subject, and the reason why the British Library is running a series of DataCite workshops.

On 3 December, I attended workshop no. 3 in this series: 'What to cite: Versioning and granularity of research data for effective citation'. (I hesitate to call it the third one, as if you count the additional workshop on using DataCite services it was actually the fourth.) Elizabeth Newbold (BL) noted in her introduction that the issues seem to be getting more complex as the series progresses, and judging by the depth of the discussions at this one I can quite believe it.

The workshop was organised into two parts: it started off with four talks giving perspectives from different stakeholders, followed by some group exercises and discussions.

Kicking things off, Roy Lowry (British Oceanographic Data Centre) gave the discipline-specific data centre view. Oceanography is one of those disciplines where the data are sparse and costly to collect, so they are routinely shared. It is also characterised by dynamic data: datasets are constantly expanding as new sensor data come in, and if systematic errors are detected the data are corrected. Up till now, the data centre has only needed to serve up the latest version of the data. The idea of a data citation, though, implies that a reader of a paper ought to be able to retrieve the version of the data used by the author. The BODC is therefore having to overhaul radically how it manages data in order to support version snapshots. It is also considering providing access to both the submitted version of the data and the final normalised form. This is to support authors who need to refer their newly submitted data, and don't want publication of their journal paper held up by the BODC's value-adding activities.

Next, Neil Jefferies (Bodleian Library, University of Oxford) gave the view from an institutional data repository. Neil's starting point could hardly have been more different from Roy's: the Library was only used to static holdings, but now must learn to deal with dynamic data; it also has to deal with data of such variety it cannot hope to normalise them. Nevertheless, the workflow he described was similar: very little intervention occurs before deposited data are made available on the system with an identifier. Metadata are added and improved later on. Versioning is of course very important, and Oxford's system not only records the provenance of individual datasets but also (through the concept of aggregation) links between related datasets. On the matter of granularity, it is possible to construct URIs for any file held by the system, but as a rule of thumb only the identifiers for complete sets are 'promoted' to DOIs. The thought is that each citable entity should make sense independent of any wider set or collection to which it might belong.

If these first two talks had been about forcing datasets to conform to a static publication model, the talk from Rebecca Lawrence (Faculty of 1000) turned the whole idea on its head. F1000 Research is a journal that has adapted its publication model to mimic the kind of dynamism exhibited by data. Papers are published almost immediately, following some initial checks. Papers are then reviewed in public by both nominated reviewers and anyone else who is interested, and authors have the chance to submit new versions based on comments. This approach threw up some issues that also apply to data. When services such as Scopus measure the impact of papers, should the citations for different versions be combined or kept separate? They decided they should be combined, even if the authorship changed between versions. Should new versions be cited by the original publication year? They decided yes, so it remains trivial to determine the order in which papers were published. Should the addition of a new review trigger a new DOI? They decided not, as such a change would not be substantive.

The final talk was from Simon Coles (National Crystallography Centre, University of Southampton), putting forward the researcher perspective. He pointed out that while most publishers host supplementary data in support of the articles they publish, some are going off the idea, so researchers should consider carefully the other options for making their data available. Two alternatives he introduced were eCrystals, a crystallography data repository based on EPrints, and LabTrove, an electronic laboratory notebook system based on a blogging paradigm. LabTrove records experiments in series of blog posts and file attachments, allowing for highly granular data retrieval. One can get everything, or just one file, or just the methodology posts, or all the data relating to a particular experimental stage. The LabTrove developers are just now tackling how and whether DOIs should be assigned in each of these cases.

The talks were followed by two exercises. In the first we voted on which changes to a dataset should trigger a new DOI, and discussed the results. Generally speaking, people felt new DOIs were needed if and only if the changes could affect the conclusions drawn from the data, though of course there were edge cases that prompted debate.

In the second exercise, we split into three groups, each discussing a different question. My group tried to decide if DOIs would be an appropriate way of making other research outputs citable. It took a while for anyone to mention it out loud, but I'd say we were guided by the notion that DOIs should be a guarantee of persistence, and they should point to digital objects.

One of the other groups tried listing the issues to consider when writing a policy on granular data identification. They came up with quite a few, but I think the biggest was that different boundaries seem natural for different communities, so it is important to be flexible.

The remaining group was asked for the unique challenges of attributing credit for datasets. They found that many of the challenges also applied to journal papers as well; in fact, the thing that really distinguished datasets was the scale at which one might have to give attribution. They did make a point sometimes overlooked, that it would be useful for datasets themselves to cite their precursor datasets.

This was certainly a worthwhile workshop, and I got a lot out of it. It gave an airing to issues that are challenging on a theoretical as well as a practical level. The presentations have been made available from the workshop archive. The last workshop in the series will be held in February or March 2013, so do look out for it.