IDCC13 Data Publication: generating trust around data sharing

23 January, 2013

Publishing data is hardly a new concept, for example the Journal of Physical and Chemical Reference Data is in its 42nd volume. But reference datasets are of course an exception to the rule; maps of the known disciplinary terrain rather than the stuff normally encountered when exploring it. Recently though the idea of a ‘data journal’ has gained a lot of traction as a vehicle for publishing research data. One of the issues this session was concerned with was the potential for this and other publication models to make data sharing work more effectively for all those interested in its creation and reuse.

A data paper (or data article) takes data that’s been deposited in a repository and expands on the ‘why, when and how’ of its collection and processing, leaving an account of the analysis and conclusions to a conventional article, perhaps written at a different time and by different authors. Data journals are not the only model around; ‘enhanced publications’ take the rather different tack of integrating underlying datasets into the online article. By developing services around vocabularies and visualisation, enhanced publications should help the data to give more voice to the claims made and allow others to contest those claims.

Both of these strands of thought were in evidence during the Data publication session on day 2 of IDCC13, and in a post-conference workshop that I organised with the PREPARDE project (more of that later). In both cases the discussion often returned to the question of ‘trust’ – in the relations between data producers, users, and the various intermediaries involved, mainly repositories and publishers.

Dutch organisation DANS (Data Archiving and Network Services) has done leading work in the enhanced publication area. Maarten Hoogerwerf took us through results from their work in the EU OpenAIREplus project to support linking of data and other contextual information to publications.  As he reminded us, there’s a lot of diversity across research fields in the kinds of data and enrichment that are both needed and can be realistically implemented. Also different stakeholders frame their needs in different ways; libraries approach data publication as a metadata linking issue, publishers as an issue about managing the ‘supplementary information’ they allow authors to submit with their papers, and researchers tend to see it as a way to link articles to data visualisations.

In OpenAIREplus, DANS and other partners built demonstrator projects in life sciences, social sciences and humanities, and then reflected on how common needs could best be supported. Disciplinary knowledge was represented through relevant repositories, for example in the UK the British Atmospheric Data Centre (BADC) and UK Pubmed Central. Each demonstrator aimed to make disciplinary datasets available though a generic portal. This harvested descriptions for publications, datasets, research projects and researchers from various sources including the disciplinary repositories, linked these together and provided navigation between them.

Hoogerwerf identified the interesting, though predictable, conclusion that management of these resources is best delegated to the original data sources (the disciplinary or sub-disciplinary repositories). The most useful relationships between objects that can be represented are those that can be captured by the most knowledgeable stakeholders. There is just too much variety in the content and user requirements to handle advanced visualisations of these relationships in a cross-disciplinary portal.

Instead cross-disciplinary efforts should concentrate on supporting resource discovery. Towards that, there’s a need to distribute effort clearly, according to what sounded like the fine European principle of ‘subsidiarity’, i.e. with responsibility delegated to the disciplinary level most capable of dealing with it in a trustworthy manner. Alongside that, Hoogerwerf aligned enhanced publication with the goal of stable and globally unique identifiers schemes like ORCID, vocabularies for objects and relationships between them, and clarity on the data granularity required to meet user needs.  Neither these conclusions nor the one that “more examples are needed” sounded revolutionary, but this was about infrastructure-building which of course needs piecemeal evolution.

DANS colleague Peter Doorn took a broader look at trust, placing this in the context of three recent cases of scientific fraud in the Netherlands, most notably in the field of psychology. He pointed out that data integrity, authenticity and quality are common concerns in the curation community. We have standards that aim to ensure they are maintained, but who is responsible for what? Repositories may take responsibility for authenticity and provenance, as far as they can track these, but they rarely take deliberate scientific fraud into consideration, Doorn argued.  

The Schuyt report from the three Dutch universities most directly affected by the fraud came up with recommendations from their joint investigation. Their committee investigation spanned both outright fraud (like making up data) and ‘sloppy science’ – specifically in psychology but with wider ramifications. The report went further than calling for data to be published as a measure to improve transparency. It also stated that research assessment committees should take data management more seriously, and not simply accept journals’ peer review conclusions. Doorn also pointed out the report’s call for investment in training for researchers as a way to promote research integrity.

Two initiatives that Doorn linked to research integrity were the Data Seal of Approval, and the Journal of Open Psychology Data from Ubiqity Press (also known for the JISC funded PRIME project). Both endeavours mesh well with the fraud report’s call for the psychology field and its journals to make better use of archives and repositories. The Data Seal of Approval (DSA) trust mark is awarded to these. To satisfy its evaluation criteria they need to ensure they obtain enough information from data producers to enable peer review.  The Journal of Open Psychology Data offers researchers the opportunity to describe in a peer reviewed data paper how a dataset may be useful to other researchers and research users, once it has been deposited in DANS or another repository that satisfies Ubiquity’s criteria.

The question remains though; how can we be sure that enough checks are in place on data quality through the peer review processes carried out by Journals or in research groups? Disciplinary cultures must play a part in this; and during the discussion Nicholas Weber from Uiversity of Illinois suggested that ‘open and closed evidential cultures’ (I think referring to Harry Collins’ work) could offer insights into how that happens.

Sarah Callaghan took up the theme of data peer review, a focus of the JISC funded PREPARDE project.  She pointed out that Journals have published data throughout their 348-year history. In the last few decades, publishers have responded to the growth of digital data by allowing authors to deposit supplementary materials. Recently this has become a burden, and ‘supp info’ tends to share with departmental websites and the like a lack of discoverability and persistence compared to deposit in a recognised repository. 

The problem PREPARDE is concerned with stems from this lack of persistence; unless data is managed and made discoverable on an equal footing with articles the scientific record risks becoming a trail of broken links. One response to this is the Dryad or PANGEA model, where publishers point authors to a 3rd-party repository to deposit their underlying data in, and then both parties maintain their respective parts. Another is the ‘overlay journal’ where authors produce data papers as described at the beginning of this article.  This was pioneered in the JISC funded RIOJA and CLADDIER projects around 2005-7 and is now exemplified by the Geoscience Data Journal and Ubiquity Press’s offerings.

The result is potentially three kinds of research output (dataset, data paper and journal article) each of which can be cited and gain credit for their contributors. As Callaghan pointed out one of the key selling points of this model is that a data article may include in its author list people involved in data management or processing who would not otherwise be credited in a journal article.

Another key question is whether data articles fit more easily into the journal peer review system than do datasets themselves. If data articles afford better scientific review of data quality then potentially they will get more kudos than the underlying dataset itself. But that arrangement also depends on trusted relationships between the actors involved. So while researchers and repositories may benefit from greater visibility, publishers can leverage the effort repositories spend on technical quality assurance.  

The Geoscience Data Journal, recently launched by Wiley-Blackwell and the Royal Meteorological Society, is a testing ground for PREPARDE to address questions around what ‘trustworthy’ means in this context, how peer review should work, and how publisher and repository workflows can be effectively integrated.

Both Sarah Callaghan and Peter Doorn took up the theme of trust in the post conference workshop on Data Publishing. A report on that will be here soon.