JISC workshop takes data repository options further

30 October, 2012

Alternative solutions to depositing data in repositories were the topic of one sessions in the JISC workshop on Components of Institutional Research Data Services, 24-25 October in Nottingham. This was one of the more popular of the eight sessions, so much so that it had to be split in two. The half described here involved three pilots making headway with the ePrints platform; at University of West of England, Research Data @Essex, and the KAHow to move beyond fire and forget?PTUR project. These were presented by Liz Holliday, Tom Ensom and Carlos Silva respectively. Following that, Stuart Lewis gave an enlightening update on various deposit scenarios in line for support through the SWORD initiative. This neatly summed up a common theme - How to move beyond a 'fire and forget' model of depositing data and offer richer options.

Liz Holliday was first, relating UWE’s experience in turning ePrints into a service for data.This project has aimed for emphasis on ease of use, clarity and simplicity. In designing the workflows involved, UWE have extensively involved and consulted researchers, seeking to produce a service that is both useful and used by them. The UWE team is focused on complying with RCUK emphasis in its OA policy for access to research data underlying publications, and doing that in the context of a small institution where research is a relatively small part of academic activity.

Their philosophy has been to develop first, make this collaborative, and deal with policy later. Involving a core group of researchers seems to have paid off in terms of confidence that researchers not only will use their system but also actually like it. It was interesting to hear that key issues have been getting properly cited, and the issue of who checks data at the end of projects. It will also be interesting to see how thorny issues around deposit agreements, embargos and access restrictions are resolved, as a lot of the University’s research has commercial co-funders and, in health areas particularly, a strong need for governance of personal data.

Tom Ensom spoke next on RD@Essex, where he has undertaken development to adapt ePrints for data. The work that has now reached beta testing stage, invoving a core group of researchers involved throughout development. The project has also released a metadata profile that aims to provide a realistic core around which disciplinary metadata can be linked, drawing on Datacite plus INSPIRE and the social science standards DDI 2.1.  Tom identified challenges in adapting a platform designed primarily for articles to represent the more complex relationships found in data. Issues remain in handling complex inter-dependencies such as those found in geographic datasets, and multiple versions of ‘the same’ object.  This is an area where SWORD2 is expected to help by offering standard ways of dealing with versioning.

Carlos Silva from KAPTUR outlined the technical approach being taken in this collaboration between arts institutions. This follows on from a Southampton-led project for arts and humanities researchers to upload data to ePrints. An instance of the repository platform is installed at each of the four partners; Goldsmiths, Glasgow School of Art, University of the Arts London and University of Creative Arts. KAPTUR has trialled integration of ePrints with two alternatives catering for different deposit scenarios; Datastage and Figshare.

KAPTUR’s use of Datastage follows the now conventional two-stage model, aiming to handle active research data first, and then publish selected data outputs in a more repository-like environment. KAPTUR envisage a Datastage instance per institution, with ePrints taking the role occupied by Databank in the Oxford University Dataflow project. In parallel with this, KAPTUR are piloting integration with Figshare, leveraging its ease of deposit and multimedia file presentation features. This is an intriguing combination that will no doubt be closely watched by other institutions using ePrints. It will be interesting to see how researchers treat these e.g. as alternatives or to complement each other. How the flow of data and metadata can be tracked effectively across the three platforms will be another issue worth tracking.

Deposit workflows are of course the bread and butter of the ongoing work on SWORD, and Stuart Lewis gave a tour of the capabilities being considered for SWORD to handle a wide variety of transfer processes. Rather than thinking of deposit only as a ‘fire and forget’ mechanism, SWORD is now conceived more like a noticeboard on which a client can post items and then come along later and update, replace or remove them, so a data collection can be deposited using SWORD without having to be in a finished state.

Basically SWORD takes ‘stuff’ from one place and delivers it somewhere else. But this might involve any of a proliferating range of sources and possible target systems. For example we should expect to see deposit originating from a CRIS, or a publisher’s workflow, or an electronic lab notebook. Target systems include institutional data repositories that may or may not be the same IR as handles publications, publication-related repositories and ‘staging’ repositories e.g. Datastage instances, as well as longer-established types like national data centres.

The idea of a ‘fingerprint’ is to match user requirements for the various combinations of source, target and data type (content, description or collection). Deposit scenarios, detailed on the sword blog linked above, can be expressed neatly using this concept, which has helped consideration of how SWORD can accommodate specific needs, such as those for handling large files. Strategies include deposit by reference, which works rather like the postcard that a courier might put through your letterbox to notify you of their intention to deliver. In this case the courier might be something like gridFTP, designed for delivering extra large packages.

SWORD will no doubt be key to making sure that research data judged worthy of broad attention is available across institutions and publishers best able to keep it discoverable. It will be very interesting to see how it gets taken further in data publication initiatives such as the new JISC project PRIME.  

[Text corrected 31/10/12. Thanks to Stuart Lewis for pointing out that SWORD is not a distributed system that includes the ability to advertise needs]