Because good research needs good data

Data Publishing - Libraries & repositories presentations at IDCC

Key points from the three talks in this session (Parallel B, session 1), which each considered the changing relationships between domain repositories and other actors - university libraries and publishers.

A Whyte | 19 March 2014

Towards a symbiotic relationship between Academic Libraries & Disciplinary Data Repositories: A Dryad & University of Michigan Case Study – Jennifer Green

The session began with some speculative thinking presented by Jen Green, exploring potential partnership roles between institutional and domain data repositories. This arose from University of Michigan Library’s deliberations on how to get the most out of local infrastructure investments, by leveraging researchers’ preference for aligning with repositories in their own domain. Michigan took Dryad as a case study on the basis that it could offer distinctive opportunities to academic libraries.  Green also referred to last year's ICPSR and Sloan Foundation report, and argued that domain repositories more generally offer certain advantages to academic libraries, which should consider them as potential collaborators rather than simply as a competing choice for researchers.

Green gave a helpful fly-by overview of Dryad, though there must be few in the IDCC community not already aware of the basics; emerging from a collaboration between Duke University and University of North Carolina at Chapel Hill, the Dryad repository holds datasets relating to published articles, initially from Journals in the ecology and evolutionary biology and more recently broadening out to other science and medical domains.  

A large part of the Dryad model is that it integrates data deposition with journal article submission, aligning with journal mandates and endeavouring to make the deposit process as painless as possible for researchers. And it has gained the support both of individual researchers and journals, growing into a membership organisation drawn from publishers, professional associations, and learned societies.  Green argued that this also offers opportunities for academic library membership, for example: -

  • Gaining a stake in shaping the repository’s development
  • Covering the cost of Dryad vouchers to pay for selected target groups, e.g. younger researchers, to deposit their data.
  • Performing local advocacy, advisory and outreach to remotely support ingest to Dryad, through internships and graduate student
  • Harvesting selected data from the domain repository as well as pushing data to it

Green proposed that the benefits to the institution would include greater visibility for research outputs, as already enjoyed through established relationships with social science archives (e.g. ICPSR). There may also be drawbacks; for example Dryad cannot currently be searched by institutional affiliation, and its lightweight curation model may not offer enough support for some domains more specialist requirements.  All the same, her main message was that, provided we can make domain and institutional repositories work together effectively, Dryad and others offer great potential for academic library service development, and more effective data discovery and reuse.

Library & Researcher Collaborative Data Publishing: the Southern Voting Project- Victoria Mitchell, University of Oregon

Victora Mitchell gave another example of how small-scale collaborative projects can offer templates for tackling the ‘long-tail’ of small-science research data, this time in the social sciences. The project used data publication as a focus for helping researchers over the barrier to selecting data to share and making it shareable.

University of Oregon Library staff have led a collaboration with research services and IT to set up a digital scholarship centre, establishing several data librarian posts to cover Open Access publishing, institutional repository, digital collections and interactive media development. In the smaller Southern Voting Project described by Mitchell, these staff worked with Political Science researchers; a junior faculty member plus two grad students, to set up a web-based resource using the institutional repository as the back end store for data.

Mitchell recounted how the project had integrated standard or freely available tools to help meet a specific research use case, and then reflected on what could be scaleable to meet researchers’ needs generally. In this case an online resource to help Political Scientists analyse the relationships between US State and Congressional-level policies and changes in African-American voting following the 1965 Voting Rights Act.  Voting records and demographic data were collected from state and university archives, and integrated using a variety of tools; the standard campus content-management application (Wordpress), a free web-based mapping tool (GeoCommons), plus standard software for managing stats data (Excel and Stata). Student RAs were trained in depositing the resulting data and codebooks into University of Oregon’s institutional repository.

The project concluded that the expertise offered to researchers on metadata and tool support was scaleable to some extent, and identified several difficulties that needed to be overcome; using GeoCommons to work with datasets embargoed until publication of the lead researcher’s paper; and (more predictably) issues around copyright and data licensing. The resource was successfully produced despite these problems, and Mitchell talked of plans to solicit peer review of the data which is now available here.

Guidelines on recommending data repositories as partners in data publication - Jonathan Tedds, University of Leicester & Sarah Callaghan, British Atmospheric Data Centre

Last but not least, Jonathan Tedds presented guidelines that emerged from the Jisc project PREPARDE - Peer REview for Publication & Accreditation of Research Data in the Earth sciences. The guidelines (which I should declare an interest in as co-author) are aimed at Journal editors and others looking to recommend a data repository for depositing data underlying articles submitted for publication.

Tedds aligned the rational for these guidelines with the policy trends favouring openly sharing data, especially data would enable published findings to be replicated. Echoing Geoffrey Boulton in the Royal Society’s Science as an Open Enterprise report, he noted its call for scientists to “make a first step to intelligent openness” by “making the data underpinning a journal article concurrently available in an accessible database” or risk being accused of malpractice.  Tedds illustrated the scale of the problem identified in that report with Timoty Vines’ recent analysis of Zoology papers. Last year Vines and colleagues tried to track the data and found that only 37% of data reported from 2011 papers was still findable and retrievable, and only 7% from 1991.

Jonathan went on to contrast the ‘crisis of reproducibility’ in science generally with the situation in Astronomy. Although much data is still not shared in that field there is much that is being reused, for example papers based upon reuse of archived observations from the Hubble telescope now exceed those based on the use described in the original proposal. More generally, data journals such as the Geoscience Data Journal offer researchers opportunities to share data, describe its reusability, and get credit when reuse happens. Journals like that and Open Health Data are partnered with repositories they rely on to manage and preserve data.

The critical importance of maintaining the link between articles and datasets was the driver for PREPARDE to come up with its guidelines, which are about how journals can trust repositories to maintain that link.  This is well trodden ground, as Tedds pointed out, with a range of Trusted Repository accreditation schemes now established, ranging from the Data Seal of Approval and ICSU World Data System schema through to TRAC and the ISO16363 standard.  Although these cover a wide range of data management capabilities, Tedds claimed they miss certain specific things required for this form of data publication.

The PREPARDE Guidelines are shorter than this blog article, and highlight 5 general principles. They assert that, for data publication, a repository must be actively managed in order to:
1. Enable access to the dataset
2. Ensure dataset persistence
3. Ensure dataset stability
4. Enable searching and retrieval of datasets
5. Collect information about repository statistics

DCC is currently adapting the guidelines so they can be used for institutions, to offer advice on data repositories to researchers e.g. through academic librarians. More generally their future is open. What role might these guidelines play in future?  Unfortunately that didn’t get discussed much as lunch intervened. One answer may be that they will become redundant once the Trusted Repository standards have had a little more time to take off.  Alternatively, and my own view, there is a continuing place for guidelines like these to pick out those factors that best fit particular use cases.  

The full-blown accreditation standards will always be complex, designed so specialist assessors can do the heavy work and then let the kite marks speak for themselves.  But trust is a two-way thing, and accreditation schemes can’t possibly cover every aspect. Perhaps these guidelines are to data publication what glossy magazine “can you really trust your partner” articles are to romance; to be taken lightly, no substitute for a pre-nuptial agreement if you really want to go that far, and probably more likely to be read. Perhaps they will fall between these extremes and be taken more seriously as the basis for identifying ‘service levels’, to govern partnerships between repositories and others.  The signs of that happening are good; the guidelines have already been applied in one notable case; Nature’s flagship data journal Scientific Data. More will surely follow.