Because good research needs good data

Bringing data into the open repository

DCC and ICPSR co-organised a workshop at OR2012 in Edinburgh on July 9th, attended by more than 60 delegates. This article gives an overview of the presentations and discussion sessions. It picks out three themes: clarity on repository boundaries, ground rules for collaboration, and how to handle...

A Whyte | 17 July 2012

Research data and how repositories can better cope with it was a hot topic at OR2012 in Edinburgh last week. I believe research data has been rising up the Open Repositories conference agenda for a few years now. At this, my first OR conference, there were two well-attended presentation sessions on “research data management and infrastructure”, plus a couple of other highly relevant sessions; one on digital preservation and another on “non-traditional content”. We also had a large room packed to the gunnels for the DCC-ICPSR workshop on “Institutional Repositories and Data, Roles and Responsibilities”.

This workshop was facilitated by Kevin Ashley and organised by DCC’s Monica Duke, Graham Pryor and myself. Our co-organisers were Jared Lyle from ICPSR, fellow traveller Ann Green, plus Gregg Gordon of US repository SSRN Social Science Research Network alongside three UK universities involved in the JISC Managing Research Data programme. These were Chris Awre from University of Hull, Cathy Pink from University of Bath, and Sally Rumsey from Oxford’s DaMaRo project. JISC’s Simon Hodson didn’t let a recent leg injury dissuade him from coming along and joining in the discussion. The slides are all available on the event page.

First up, Graham Pryor reflected on the DCC’s programme of ‘Institutional engagements’, which is helping UK universities face hard questions about justifying investing in institution-wide research data repository capabilities. How far should these go in supporting their researchers’ curation needs and the data resulting from their activities? Should they be a ‘lender of last resort’, a home for orphaned data to fill gaps left by national and internal disciplinary repositories?  Can an institutional repository be expected to deal with all aspects of managing all research data, or focus on showcasing catalogues of metadata about the assets they are responsible for, and outsource curation and preservation to the emerging range of commercial players?

Jared Lyle and Ann Green took us through early findings from their survey of institutional repositories in March/April this year. This was in line with their aim of building partnerships between IRs, social science support services, and domain repositories. Responses, which were mainly from the US, identified challenges and  services repositories wanted to help address them. Format migration, data recovery, media recovery and cost estimation topped the list. Help with appraisal and preservation policies was also high on the list. Metadata tools were a low priority for IR managers, around a third of the sample, but a high one for others who responded; these mainly librarians and academics.

Questions about the affordability of RDM services to smaller institutions and the need for collective approached segued neatly into the breakout groups, about which more later. Cathy Pink’s talk on the Research 360 project at University of Bath gave a really useful summary of the use case for even relatively small institutions to invest in institution-wide data management capabilities. First is the growing demand for access to publicly funded research data, reflected in, for example, the UK government’s “Innovation and Research Strategy for Growth”. The policies of funding bodies such as EPSRC are another major driver of course, with the university expected to bridge the gaps faced by researchers with no natural home for their data.

Thinking a little further ahead, another driver is the institution’s ability to respond to research assessments by linking valuable datasets to other research outputs, ensuring all are citable, have external visibility, and contribute to research impact. Pink outlined University of Bath's plans to embed data repository capabilities in the existing infrastructure. She also covered a few of the questions arising; how to capture metadata from external repositories, what other details will be needed to enable reuse, and how to deal with the varying funder requirements to ensure long-term accessibility. Managing commerically restricted data brings issues around how much metadata can be published. Archiving consent forms for human-subjects data presented another set of security and digitisation challenges.

Sally Rumsey picked up on the discoverability theme in her talk, asking what is ‘just enough’ metadata. That depends, for example on, whether we are talking about citing a dataset, discovery, compliance with funder requirements, assessing (re)usefulness, preservation, reporting and business intelligence. The context for the question is Oxford University’s Bodleian Library, which is rolling out Datafinder, a catalogue of research data extracted from various sources including Oxford's Databank repository. Rumsey presented a “minimum core” metadata set for Oxford, one of a 3-part recommendation alongside “contextual” and “optional” metadata. A similar distinction is made at Southampton between generic, disciplinary and project-level. While there isn’t yet agreement on the specifics, the general principle is to minimise the barriers to getting ‘just enough’ to get started and then keeping up a conversation with researchers.

Chris Awre picked up this thread, identifying various catalysts for the University of Hull IR to shape its role in supporting data; as backup to data sharing via a project web portal; as an access-oriented alternative to a subject repository, and as a familiar point of reference for conversations with researchers on data management planning.

Gregg Gordon of SSRN gave the final short talk and a different perspective. Social Science Research Network is a for-profit organisation operating one of the largest pre-print archives on a ‘freemium’ business model, and considering how data fits into that. Depositors can currently deposit almost any form of article, which is freely accessible on an as-is basis. SSRN does not seek or clear licensing rights, or offer any preservation capability. Depositors have had versioning capability available for several years, and while the daily rate of submissions has reached 60k, the rate of revisions has surpassed that. SSRN’s premium offering comprises ‘networks’ of editor-selected content on various subject themes, together with freely available metrics-based navigation. How SSRN might accommodate data remains to be seen, but Gordon is aiming to identify ‘prestigious’ content through usage-based metrics and an expanded role for repositories in peer review.

The presentations were organised in two short sessions, each feeding into breakout groups and a final report back and discussion. In an online poll before the workshop, people who had registered selected six themes from a list of twelve that the co-organisers proposed. Two breakout groups discussed each theme. As these inevitably overlapped with each other and with the plenary discussions I’ve picked out some recurring themes below.

Clarity on repository boundaries. Providing information on the scope of repositories, whether institutional or subject-based, and better advice about what to deposit where was thought likely to encourage deposit.  This means tracking a changing picture, as repository collection policies change. One case in point was the History Data Service. Since the demise of the Arts & Humanities Data Service and its relocation to the UK Data Archive, HDS has narrowed its focus to studies with socio-economic theme. Institutional collection policies may also change. Funding bodies are encouraging institutions to collaborate. This, and specialisation, may be essential to avoid a ‘digital divide’ growing between research intensive university and smaller colleges or institutes unable to meet their levels of provision.

Ground rules for repository collaboration: some discussions tackled the limits of institutional responsibility, and the more specific limitations of IR’s. The view that institutions are natural guarantors of long-term stability was repeated here. It was felt that Libraries, IT Services and Research Offices have individual and mutual vested interests in making data work as an institutional asset, and in preparing people for stewardship roles. IR’s are not necessarily the appropriate or only ‘system’, research information systems are expanding in functionality. Both have limitations in handing disciplinary metadata, and complex relationships between data files and related contextual information. These may well be handled by subject-based repositories. Even where there may be a case for institutions to offer a more limited data catalogue service there is a need for coordination between the various repositories. At one end of the spectrum are datasets from large collaborative projects depoisiting in a number of repositories. At the other extreme are unfunded research projects, where the institution may be the natural first home for orphaned datasets but not necessarily the best or last.

Handling the handover; while some researchers have well established workflows that encompass depositing data in a subject archive they are a minority. Ann Green reminded us of Chris Rusbridge’s observation that preservation is a ‘relay race’, although that suggests rather more clarity about the handover of research data to institutions than currently exists. Some of this is policy; a clear institutional RDM policy may an effective first step in the workflow, if there are services to back it up. Technology is shaping the handover point: versioning (e.g. SWORD2) may make repositories more attractive environments earlier in the research lifecycle, and what would really help here is further tool support to embed deposit in the analytic tools researchers use. Some issues are administrative; when exactly should DOI’s be assigned if content is embargoed? Others are social; the need to ‘walk the walk’ on matters of trust was emphasised, together with the need to avoid ‘best practice sharing’ being the enemy of ‘good practice sharing’.

Some points came up that don’t fit well with the above; for example dealing with software (which another workshop looked at) not to mention physical artefacts. There were no doubt other themes I have missed. This was a wide-ranging discussion, almost a conference within a conference. Those involved had diverse experiences in managing repositories but the majority were, unsurprisingly, involved in managing ‘end of cycle’ research outputs and relatively new to supporting data. That said, I think the workshop went far towards its aim to “…map points of consensus and concern around repository roles in supporting data reuse and developing the infrastructure for that.”

Laura Molloy has a great overview of how research data management was addressed in the rest of OR2012 in the JISC Managing Research Data 'evidence gatherers’ blog.