Making Data Count

17 April, 2013

For some time now, the digital curation community has argued that datasets deserve to be considered first class research outputs. Now at last there is an appetite within the academic mainstream to make it happen, but there are still a lot of implications to work through. The importance of a research activity is judged by the impact of its outputs, so one thing we need to work out is how datasets fit into research assessment procedures. This was the topic of a workshop held by the Knowledge Exchange – a collaboration between Jisc and its peers in Finland, Denmark, Germany and the Netherlands – on 11-12 April 2013 in Berlin.

The workshop started with a series of talks from invited speakers. Simon Hodson (JISC) began with a bit of historical context, demonstrating that many of the concerns that researchers express about data sharing have been circulating for at least 300 years. John Flamsteed, Astronomer Royal, did not release the data that led him to believe that a comet sighted in 1681 was the one from 1680 on its return journey. Reportedly, he regarded the data as his own property rather than a public good (because he had paid personally for much of his equipment), and was concerned that it was incomplete and insufficiently verified. Furthermore, he was angry that Newton had repeated his theory without attribution. Many researchers today have similar concerns, but perhaps could be reassured about data sharing if the kinds of processes and cultural norms associated with papers could be applied to data as well:

  • accepted quality standards and checks;
  • due credit given via data citations, and due rewards calculated using impact metrics;
  • codes of conduct governing how data may be shared and reused.

Katrien Maes outlined the thoughts of the League of European Research Universities on research assessment and open scholarship, concluding with an action plan for universities. She suggested they should

  • develop a data management strategy and plan,
  • stimulate data sharing among their researchers,
  • provide training on data management and digital curation for students and research staff,
  • recognise shared data when reviewing performance and assigning rewards, and
  • openly communicate the benefits of data sharing and collaboration.

Geoffrey Boulton (Royal Society and University of Edinburgh) criticised the way in which data has become divorced from research conclusions in papers. Indeed, he argued that publishing papers without the underlying data should be considered malpractice. What is needed is for papers and data to be available online and reintegrated. In particular, data should be made intelligently open; that is, accessible, intelligible, assessable and reusable. Geoffrey also suggested actions for various stakeholders: funders should contribute towards the cost of making data intelligently open, for example, and publishers should mandate the deposit of underlying data in open repositories.

Denis Huschka explained the benefits of data sharing before describing the work of the German Data Forum (RatSWD) in the area. He argued that as science is international, international solutions are needed, and that a collaborative approach is key even if one approach does not suit all circumstances.

The Knowledge Exchange recently published a report, The Value of Research Data, which explores possible data metrics and associated reward systems. Rodrigo Costas (Centre for Science and Technology Studies, Leiden University) gave a summary of the report and the research behind it. One of the useful points he made was about the data sharing vicious circle: scholars don't share because there are no rewards to make it worthwhile; as stocks of shared data are low, there is little reuse hence metrics do not give smooth, stable results; and without smooth, stable metrics one cannot fairly apportion rewards for data sharing. The report suggests that this cycle should be broken by encouraging formal citation of datasets, and by institutions looking favourably on data sharing activities when considering recruitment, tenure and promotion. The report leaves open the following questions:

  • Is the publication–citation model the right one to use for data metrics, or would another one better reflect the value and impact of data?
  • What weight should be given to data citations, relative to citations of papers?
  • What should be the incentives for data sharing when grants are received and when publications are accepted?
  • Do data metrics have a role in selecting data for long term preservation?

We then had a discussion lead by a panel including Ortwin Dally (German Archeological Institute), Ross Mounce (University of Bath and Open Knowledge Foundation), Riitta Mustonen (NordForsk), and Joachim Wambsganss (Heidelberg University). Here are some points I picked up from it:

  • On the matter of big versus small data repositories, there are challenges to overcome at both ends of the spectrum. Big repositories need to provide good tools for filtering data, and to manage multiple user communities. Small repositories, meanwhile, need to work hard on interoperability so their holdings are visible to cross-search engines.
  • There is no magic formula for successful data sharing in a community: human nature and accidents of history play their part. Treebase was set up by botanists, and is used extensively by botanists but not by, say, zoologists whose data would be equally at home there.
  • Perhaps funders should make it a policy to favour bids that promise excellent data sharing but merely good research, over bids that promise excellent research but without sharing data.
  • Both carrots and sticks are needed to encourage data sharing. There is little point requiring data sharing if compliance is not checked: perhaps there should be hotline potential reusers could ring if they can't find data that should be available.

On the second day, there were two more talks representing additional perspectives. Bill Michener (DataOne) sent a video message in which he set out a three-point action plan for encouraging data sharing:

  1. give researchers due credit for sharing data, leaning on identifier schemes like DOIs and ORCIDs, and services like ImpactStory;
  2. give researchers the tools to make data sharing easy;
  3. effect cultural change through education, training and funding policies.

Jan Brase (DataCite) demonstrated some of the lesser-known DataCite services, including the metadata search service and resolution statistics service.

The rest of the workshop was taken up by five parallel groups each tackling a different issue. I was in the group looking at impact metrics for published data. We recommended that we should immediately collect and collate as much statistical evidence as we can on how datasets are being used (downloads, DOI resolutions, citations, etc.) and, in parallel, study what these statistics actually imply in practice. In the longer term, we need to understand how to take account of disciplinary differences when applying and interpreting the metrics.

We were perhaps more enthusiastic about metrics than the group considering the research assessment procedures of funders and universities. They felt that metrics could not replace, but only supplement, peer review of datasets. They also agreed that universities should bring up data sharing in job interviews and appraisals, and appoint more data management specialists. Funders should implement clear, sensible data sharing policies and mandates, and encourage data reuse via their data management plan requirements.

Another group discussed codes of conduct for sharing data. They recommended such codes be written by learned societies and publishers, with funders providing co-ordination. Ideally the codes should be specific on implementation issues (such as criteria and standards to use) and be collected into a catalogue for researchers to consult.

The next steps to improve linkage between data and other research outputs, a fourth group argued, should be to standardise both data citation practice and contextual metadata. Linking should be extensive and reciprocal, and amenable to expression as Linked Open Data.

The remaining group was interested in quality assurance processes. Their immediate recommendation was to find domains where data peer review is currently working well, and encode those processes as templates that other domains might adapt.

I was pleased with the practical focus of this workshop. It would have been all too easy to spend the time on old ground, rehearsing the benefits of data sharing to one other and listing the barriers. Instead, we came up with a range of practical recommendations and aspirations for the members of the Knowledge Exchange to consider. I hope the wider community also takes notice.

Presentations from the workshop are available from the event webpage.