Because good research needs good data

Jisc research data spring takes its next leap forward

The 15 projects funded through the first round of Jisc’s Research Data Spring programme competed for further funding in the second sandpit workshop. Here Angus Whyte reviews some of the work presented.

A Whyte | 21 July 2015

The fate of 15 projects funded through the first round of Jisc’s Research Data Spring programme was sealed over two days last week (July 13 and 14) at Imperial College, London. It was a great opportunity to sample what has already been done, and I was really impressed by how much these projects have accomplished in a mere 3 months. Below I’ve summarised some key points I took from the presentations, including links so you can judge for yourself.

This was workshop 2 in the Research Data Spring process, and as Kevin Ashley outlined in a previous blog, the projects that successfully bid for three months of initial development funding were making the case for up to £40k of additional investment. I’ve no connection to the judging, and assume some projects below won’t make it further. But those successful this time around will go through the same hoops again, most likely in December.  Third-round winners get another £60K to build what are judged at that point to be the services most likely to succeed.

As in the first ‘sandpit’ workshop, project teams made four-minute pitches to a panel of judges on day two, who then grilled them for another four minutes. There was also the opportunity on the first day to share slightly longer overviews of project achievements, and their approaches to collaboration, and to catch up during the scheduled breaks. I picked out four themes based on those framing the original project ideas, omitting ‘shared services’ given that’s a common aspiration. So the themes I’m using are:

  1. Improving the researchers experience
  2. Deposit and sharing tools
  3. Seamless integration
  4. Analytics for research data activity

To finish I enter into the judgemental swing of things by picking some co-design favourites.

1. Improving the researchers experience

AMASED is one of two RDS projects with eyes on the enormous reuse potential of research data that cannot be publicly shared for privacy reasons. It exploits DataSHIELD, a toolkit originating in an EU-funded project. Basically the idea is to ‘take the analysis to the data’ without the user actually seeing the raw data at any point.  DataSHIELD is enabling bioscience researchers to analyse patient-identifying data across studies. Conventionally they would do these meta-analyses by pooling datasets from studies, e.g. those that share similar variables, to enable more powerful analysis of larger or richer samples.

The DataSHIELD software has great potential to facilitate data reuse, and save money in this context. That’s because data pooling conventionally requires some level of disclosure, which has stringent ethical constraints that also limit the opportunities for analysis.  Working around those ought to allow more people to do more analyses, and AMASED is testing how that can extend to other domains and applications.

So far the project has been attempting to flesh out two further use cases, and seems to have made some progress on one. That is to enable readers to analyse confidential data referenced in data articles from publisher F1000 Research’s data journal. The other test use case is to allow analysis of a digitised books dataset held by the British Library. I missed their presentation of work done so far, but I guess an important challenge will be assessing how far automated data cleaning can compensate for researchers’ inability to ‘eyeball’ the data.

Jisc has funded many projects digitising cultural heritage collections over the years, but the potential to analyse these using high performance computing (HPC) has been largely untapped. HPC has not typically meshed with the kinds of questions arts and humanities want to ask, or that new computation techniques can answer. The Enabling complex analysis of Large Scale Digital Collections project has made some impressive inroads here, working with digital humanities researchers to ingest a range of data types and sources, and to understand and facilitate the kinds of advanced query they want to perform.

The aim is to make humanities researchers into independent users of HPC facilities, by offering the service and ‘recipe knowledge’ to cater for the 5 or 6 searches that 90% of them want to do. The project team’s hackathon with researchers at UCL to use the university’s HPC facilities has already produced some intriguing analyses on the correlation of epidemiological data on disease outbreaks in the UK with mentions of these in historic texts. The project looks full of potential for HPC facilities to win points for impact, as much as for arts and humanities researchers to take advantage of real computational power. So if you're up for that, bluclobber is the word!

The Sound Matters project grabbed my attention for the way it has developed its framework for archiving and reusing sound recordings. It’s aim is to facilitate “interrogation and relational playback of sound in its own terms”, as described on the project blog. The focus is on humanities researchers who use fieldwork to gather sounds and speech.

The project started with a straightforward and elegant model for understanding these researchers’ interactions with archived recordings. What really impressed me was the effort that has gone into working with this interdisciplinary community to take the framework further. This has ranged from interviews and a co-design event to an online community event, which was publicised through social media and used ‘virtual board’ to capture comments on the framework. This then fed into a co-design event.  There’s a lovely video here with a foretaste of prototypes to come if, as I hope, they get further support. Maybe there’s something about the medium conveying the message in a more interesting way, or maybe it’s the embedding of a remix and reuse culture in this discipline. Whatever the key ingredients, this project seems to me to effortlessly make the case for putting RDM support in the hands of researchers who really understand what it takes to make their work reusable. 

Clipper addresses a gap in the tools available for working with digital audio and video clips. The emphasis is on ‘non-destructive editing’ of a/v materials. That means making it easy to work collaboratively with clips, annotate, organise and share them, while leaving the a/v data itself untouched. So far the project has been using a proof-of-concept demo to test demand, and get feedback, and has generated a fair amount of interest particularly in the cultural heritage sector.

I gather this tool shares some characteristics with tools used for marking up and analysing a/v data in linguistics and qualitative social research (CAQDAS tools). These cater for discipline- or methodology-specific analysis, but tend to limit the ability to share and reuse the results independently of the tool. Clipper is taking a relatively simple and generic approach to annotation. Potentially it seems a great solution for researchers who use digital a/v from archived collections. It steps aside from archiving the a/v data itself, or providing access to it, so researchers who create a/v research data may need alternative tools. They will also still need to find a repository suitable for hosting video collections. But for making clips and annotation shareable, Clipper may have hit on a niche. It would be great if the tool could go a step further and support citation, perhaps going some way towards meeting the recommendations for citing dynamic data coming out of the Research Data Alliance's Working Group on Data Citation.

2. Deposit and sharing tools

The idea of ‘sheer curation’ is to capture metadata on the fly from research activity, rather than trying to trawl it up afterwards from reluctant researchers. If an appropriate way to describe the challenge is 'inventing an elixir of youth for the research data archive' you might be sceptical that anyone could claim to have done so in the space of 3 months. But of course the two RDS project teams working towards that end – Artivity and CREAM - have been doing so for a lot longer than that. The projects are taking contrasting approaches; if I understand correctly Artivity is starting from the particular and working up to the general, while CREAM takes a broader ‘from first principles’ approach.

Artivity is about capturing contextual metadata from digital artists as they use the tools on their desktop. It builds on much previous work on frameworks to make the semantic desktop idea concrete, and implements one such framework for two graphic illustration tools, recording the user’s actions. The aim is to appeal to artists who want to self-archive a provenance record of their work, and to art and design historians who want to curate the techniques of digital artists.

The impressive demo showed metadata being captured automatically, with some nice splicing and dicing of stats about user activity. I hope the project can address the questions around take-up, which seem quite sizeable ones. On the ‘desktop’ side, can the project deal with the fact that artists and designers mostly use Adobe tools on Mac OSX, not open source Linux tools?  On the semantics side, is the ‘context’ that’s being recorded consistently meaningful to art and design historians as a record of ‘technique’?

CREAM has some things in common with Artivity (as well as Thanasis Velios who is on both teams). This project also explores the use of metadata to record research as it happens, on ‘live objects in real time’, rather than assigning it to an archived object, and it also conceives a general framework for managing and understanding that metadata. That framework is being shaped using Nine By Nine's Annalist,  a prototype out-of-box tool for semi-structured data collection and organisation, that 'supports evolution, collaboration and reuse' of linked open data. The project partners are using this with their lab notebooks to work up four case studies in different disciplinary contexts; chemistry at Southampton and STFC, arts at University of the Arts London, and earth systems in Edinburgh.

These are mostly ‘big science’ contexts, where the processing of research data is strongly determined by the planning of its collection, ‘known unknowns’ about the data, and ‘known knowns’ about the equipment generating it. The project partners are also big names in the development of linked data tools and standards, the other element of the project’s framework. Automated validation and quality checking are the practical benefits they believe to be in reach. The project also seems to be filling a niche in provenance standards, essentially by demonstrating how it may be planned in advance (with no Tardis or other time machine involved).

DataVault could just as easily fit under other headings, as by making it easier to package data for archiving it is aiming to improve researchers’ experience, and doing that will involve ‘seamless integration’. Data Vault takes a remorselessly pragmatic line. RDM has been focused on making more data more openly accessible, and on principle quite rightly. But institutions and many researchers face the reality that much of the data they need to archive cannot be shared publically, at least until their share-ability has been determined.

DataVault has set its sights on a basic level of archiving for the broadest range of data. The concept is basically to take research data that is no longer being actively used, describe and package it, and move it to cheaper storage such as Amazon Glacier, local tape backup infrastructure, or Arkivum.  This involves two main functions, the first being the interface to handle the storage and retrieval of data packages from an underlying storage management layer, and the other a policy engine to manage the packaging and description, and apply information security, retention and any other controls.

The emphasis in phase one of the project has been to build a working prototype, using bagit to package files with datacite metadata and some basic project information. I can't find the prototype online but you can follow the project progress here. In phase 2 the plan is to offer more storage options (some initial discussion with Arkivum was mentioned), tackle authentication and authorisation, demonstrate CRIS integration, and build the capability to handle retention and review policies.  The policy engine sounds a fairly major piece of work, and probably critical to making the concept work. I suspect the non-technical issues around governing the data access terms and conditions will be the biggest challenge there.

Giving Researchers Credit

In a year of data policy mandates that are (probably wrongly) seen as big sticks to prod researchers with, this project promises carrots for all. The idea targets institutional repositories, and leverages publisher efforts to promote data papers. On the face of it it’s a simple concept – provide a one-click ‘submit a data paper’ button. This would (it is hoped) incentivise deposit by making it easier to transfer all relevant details from an IR to a relevant publisher platform for publication as a data paper. There, the data, metadata and methods can be worked up as needed, then peer reviewed, and indexed to increase visibility. If this takes off, everyone gets better metadata. IR’s win by engaging more effectively with depositors, publishers win by getting more submissions, and the depositor wins by getting the effort to create and describe their data recognised by the community as a formal research output.

In its first three months the project surveyed publishers and repositories to gauge their interest, and reportedly found it in 16 publishers and 34 repositories. In practice, repository and publisher workflows can be tricky to integrate. The project is drawing on the Research Data Alliance’s working group on Data Publishing Workflows. This has recently been investigating the rich variety of repository deposit workflows (interest declaration – I’ve had minor involvement in that work). Instead of each publisher-repository pair attempting to negotiate their own solution, the project intends to develop a helper application, probably deploying the SWORD protocol. The end result should offer a ‘straw man’ reference implementation, hopefully beginning in the next phase by linking up Oxford University’s Bodleian Library with publisher F1000 Research.

Software reuse, repurpose and reproducibility

Despite wide awareness of their importance to research reproducibility, approaches to software citation and preservation are less mature than those for research data. This project (RRR) aims to help close that gap, initially by drafting guidance on the use of DataCite DOIs for software citation, and recommending changes to DataCite metadata to support software.  These recommendations were presented at a recent DataCite meeting and are being considered further by that organisation’s metadata working group, as well as Force 11’s software citation group, which is impressively impactful for three month’s work. 

The possibly more intriguing element of RRR is the software it intends to deliver and has started developing. This is intended to apply the principle that software citation entails keeping software in a runnable state, making good on the slogan “discoverability is runnability”. The project involves, and I gather follows its approach of offering tools to make available virtual machines in which experiments have been validated to run. The premise is (I believe) that preserving research software in a virtual machine makes the reproducibility more straightforward to address. If, as planned, the tools gain the acceptance of the project’s heavyweight partners we can expect to hear a lot more about the approach. I would hope that also addresses licensing issues for software held in the virtualised machine.

Unlocking thesis data

There seems an undeniable case for institutional repositories to make the data underpinning PhD theses persistently identifiable and available online. For one thing, the EPSRC requires it for studentships they fund. So this project might have been forgiven for pressing ahead and developing some form of plug-in without taking the time to get extensive input from stakeholders and users. However they have done exactly that. Although some of the dialogue they kicked off took place in workshops at IDCC and elsewhere, before their successful RDS bid, the team has gone on to survey the sector, and follow that up with mini case studies. Their report identifies the steps needed to apply PIDs to theses and have them harvested by EThOS, the British Library’s national thesis publication service.

There are of course many more differences between theses and journal articles than their length, as the project's work shows. There are more complex relationships to data and ‘supplementary materials’, different cultures and workflows around examination and deposit, and consequently different sets of user and depositor relationships to deal with. The project found no institution minting DOIs for theses. It also found that so far EThOS has not successfully harvested a single ORCID identifier from institutional systems. Their report considers various reasons for that, and how support could be improved in EThOS and institutional repositories, building on that available in EPrints and through CRIS systems such as PURE.

3. Seamless integration

Streamlining deposit

Most researchers’ experience with data sharing is through uploading the supplementary material for a journal article, to its publisher’s platform. For the increasing number of publishers wanting to recommend a data repository, rather than deal with supp mat themselves, the issue is how get the necessary handshaking set up so it’s straightforward for depositors. The Open Journal System (OJS) is one of the more widely used platforms with 25k installations worldwide, a potentially large market for a data repository integration solution. The Streamlining Deposit project is pitching at this with what seems a tempting offer; plug-ins for Figshare initially, and for Dryad, other repositories, and the Jisc Publications Router if they get further funding. This should save depositors’ time, and enable more journals to make their supplementary materials citable using the repository-supplied DOIs.

It was impressive to see that authors and editors have been interviewed to establish their priorities, and that a working plug-in has already been developed. One of the proposed next steps is the Dryad data repository, which is well known for its model of linking data to articles. This approaches this problem from the other direction, as it were, but the project should be pushing at an open door (no pun intended), building on the work already done to integrate the Journal of Open Health Data with this repository.

Filling the digital preservation gap

This project fills an elephantine gap in the RDM room; how institutions can effectively support preservation beyond the ‘bit level’ guarantee of file integrity, to deal with issues around format obsolescence (for example). Funders tell institutions ‘go preserve your data’, leaving it to them to work out what level of preservation will be needed to keep data accessible. The EPSRC of course also points people to DCC guidance, which includes many pointers to good practice. Perhaps wrongly, we have not been particularly forceful in identifying what specific preservation steps institution should aspire to offer, knowing that even giving researchers a common steer on where to put data is a challenge.  The slow development of services in this area I think reflects the complexity of the gap between what is desirable and feasible, rather than it being one of those issues nobody is prepared to talk about.

At any rate, this project is taking a pragmatic step beyond talking about it. Leading open source tool Archivematica offers comprehensive support for preserving digital content. The project is investigating how to use it effectively for research data, particularly where data repositories or catalogues are already being established. The main issues here are around the range of formats to be dealt with, and the workflows to deal with them. I won’t say any more other than to highly recommend the project report, and wonder whether collaboration with DataVault wouldn’t make sense.

Integrated RDM for Small and Specialist Institutions

The largest RDS project also has the longest name, which in full is; “A consortial approach to building an integrated RDM system - small and specialist". Part of this consortium is CREST (Consortium for Research Excellence, Support and Training) a sub-group of the GuildFE body, which represents the “small and specialist” institutions of the project title. The 22 institutions in CREST are a test bed for shared services, whose potential is being explored through three case studies, one focusing on ‘improving’ EPrints for Art and Design institutions, another on ‘streamlining’ PURE, and a third examining workflows for RDM.

The workflows study struck me as a very useful piece of work. The report, published by Arkivum, provides examples of different institutions’ approaches to RDM service delivery. Each identifies key points and outlines a scenario and high-level workflow.  I’d love to see DCC contributing more of these examples. I also look forward to understanding what the first two case studies accomplish, and will be very interested to learn further details of the level of integration envisaged in CREST.

4. Analytics for research data activity



The DMA in the title is ‘data management admin’ and the premise of this project is quite straightforward; RDM managers will sleep better at night if they have real-time data to hand on key indicators of compliance.  What these key indicators are is identified in the five use cases addressed in Lancaster University’s prototype. Several deal with monitoring data management plans, others storage utilisation, and others compliance with RCUK requirements on data access statements.

The live demo showed stats drawn from real data in Lancaster’s systems, an achievement that has had to overcome the organisational barriers of working with different stakeholders groups as well as system barriers.  Additional funding will bring  further integration with DMPonline using the forthcoming API. The team also hopes to exploit synergy with the Filling the Digital Preservation Gap project, and include stats on preserved datasets in the dashboard.

The next phase would also bring further user studies, and it will be very interesting to see how the institutions waiting to test the dashboard find it adaptable to their needs. In principle the idea sounds a winner, and I hope the effort to contextualise the dashboard and its contents proves to be doable as well as sociologically fascinating!

Extending the Organisational Profile Document

The Organisational Profile Document or OPD cut its teeth in the project and is the brainchild of the Southampton team behind and, fresh of the wiki chopping board, Essentially it’s a template for research organisations to publish as linked open data some basic organisational characteristics. These were originally intended to ease the flow of information needed to locate expensive EPSRC-funded equipment, by making it more machine readable.

The 'extension' proposed in this project (by DCC’s own Joy Davidson) is is to include details of RDM services in the profile, those details spanning a range of ‘hard and soft’ infrastructure not unlike the service model put forward in our How to Develop RDM Services guide. By pointing to some online evidence that these services exist, the idea is to make the OPD a vehicle for finding them more easily.

So the rationale is to make RDM Services more discoverable by building a machine-readable collection of RDM Service profiles. That's something I’d love to see DCC developing even if the proposal doesn’t make it through the next phase. We have a growing list of institutions that have self-penned portraits of their services in our ‘where are they now’ series of blog posts. So we know that institutions are looking to set out their stall, and want to make the results go further. So fingers crossed for that one.   

And the winner is?

Similarities between the judging process in RDS workshops and BBC TV’s ‘Dragon’s Den’ panel of investors are difficult to avoid and probably not entirely coincidental.  The process certainly forces projects to present their work in the best possible light, and in less time than most elevator rides on the London Underground.  I’m personally not convinced that three months into a project is the best time to do that. Anyway, I’m not intending to pass judgement on the RDS process. The criteria the judges themselves used are available here and I'm sure they have more information to base their decision on than the pitches.  Jisc's criteria could be summarised as; 

  • Is the work done in scope for RDS?
  • Are the use cases clear?
  • Is there evidence of community engagement?
  • Is there a realistic sustainability plan?
  • Can the team deliver?
  • Are existing outputs clear, and are planned outputs feasible?

The judges and Jisc will have made their decisions on that by the time this article is online, so any favoritism on my part won't affect the outcome (I should be so lucky). Personally I was interested in how much genuine co-design has been carried out, and would be going forward. To what extent are the teams using any process to “empower, encourage, and guide users to develop solutions for themselves” (1). The overviews on day one gave a little more indication of that than the pitches. So I didn’t find it particularly easy to tell, but if I was dishing out prizes (and excluding projects with any DCC involvement) four projects stood out as having engaged most with their prospective users and stakeholders;

  • Sound Matters
  • Enabling Complex Analysis
  • Unlocking Thesis Data
  • Integrated RDM for Small and Specialist Institutions

Some of the other projects were funded to get on and develop software, and may well have deeper engagement with users in phase 2, so no criticism is intended. I’ll be surprised if many of the projects go away empty handed, and don't envy those deciding who should be in that category.

(1) 'Co-design' Wikipedia entry. Retrieved from