Because good research needs good data

Birds of a Feather session at IDCC15: Building infrastructure for scientific data

A report on the Birds of a Feather session on Building Infrastructure held on Day 1 of the International Digital Curation Conference, 2015

Monica Duke | 10 February 2015

Kevin Dyke, Terra Populus and Dharma Akmon, SEAD conveners of the Birds of a Feather session at IDCC15

Kevin Dyke, Terrapopulus and Dharma Akmon, SEAD conveners of the Birds of a Feather session at IDCC conference

The Birds of a Feather sessions were a new feature of the IDCC conferences introduced in 2015. I attended the session on 'Building Infrastructure for scientific data: Contrasting curation approaches across the lifecycle" in which practitioners compared notes on the tools they are providing for researchers to curate their data. The aim of the session was explained by the conveners as "to bring people together to talk about succeses in building infrastructure and effectively engaging scientists in data curation and overcoming challenges". The session threw up a number of examples of tools being developed with and employed in the community to help data collectors and curators to document and share data.  Succesful features and approaches were shared and compared, while some remaining challenges were discussed.

The first two examples of tools were provided by the session chairs. Terrapop (Terra Populus) gathers pre-existing data and processes them to a shareable standard, adding metadata. The service identifies data known to be useful to researchers to add to the collection; these can include demographic datasets and environmental datasets, tied together through geographic data. When historical datasets have little metadata, an attempt is made to retrospectively determine the processes used to generate the data, and associated articles are mined for metadata that could be added to the data. The example of curation here is one of adding value through metadata and documentation, representing later life cycle curation.

In contrast, SEAD has a focus on active and social data curation, with a tool aimed at the earlier stage of the data life cycle. The social tools engage teams in good data management by offering a space for cross-team working and sharing. Self-interest (rather than selfless motivation) drives the use of the tool, as the teams are motivated by a need to share within teams or across teams. Team controlled access to deposited data is a feature, with automated metadata generation. Metadata fields can be modified to allow flexibility. The recoding of data and metadata  happens as close to collection or creation as possible, but the metadata can also be gradually built, preparing the data for later transition to a repository.

Other examples of tools being developed were mentioned by the participants

  • An RDBMS system with features for editing and annotating of active data at Oxford University
  • St. Andrews University in the UK is engaging with groups of researchers (e.g. chemists) to build metadata capture systems
  • In Australia at New South Wales University a policy of requiring a DMP before storage is allocated has been used to motivate engagement with data curation
  • One system introduced a 'knowledge management' layer to capture contextual information, such as email exchanges about the data.

Finally, the efforts of the RDA group on long tail data to build a catalogue of tools and efforts engaging in this type of work was recommended.

In terms of challenges, one area of discussion was around data requiring spatial and temporal tagging. Although referencing is best done at collection, some success has been achieved by mining text at a later date. Time tagging can be ambiguous for example when a reference to 'yesterday' is discovered. When repurposing repository tools, the changing ways in which time has been recorded can present a challenge. Administrative boundaries for geographic areas change over time, requiring a snapshot of boundaries in time. It was remarked that vocabularies for geography tended to fall into 4 types (place names, administrative boundaries, postal codes and map-related).  Are there similar categories for how time is recorded? The argument was made that overall spatial and temporal tagging give context and help make resources findable, and are worth adding.

Suggestions for overcoming challenges included:

  • Use of recommender systems. This feature can be used to sell the idea of a new system. SEAD employs some recommender systems to help connect people and groups based on time and space.
  • Direct engagement with groups creating data is essential.  The data creators or others in the group with a responsibility for getting data ready should be engaged.  Ongoing conversations about getting data and making it ready for deposit provide opportunities for engagement.
  • Offering help with packaging and deposit through the intervention of a curator is often welcome. Having the curator in the loop interacting with scientists earlier helps them determine what needs to be captured, so they become better informed and prepared.
  • One approach is to allow self-deposit, with a follow-up to submission by curators, who review the data sets and begin an exchange with researchers to put together the documentation.
  • Although access to a collaboration system can be a driver, the need to learn a new system, create accounts etc can be barriers that need to be overcome to motivate a switch from other ways of sharing (e.g. Dropbox, email).
  • Repositories that bring data together sometimes need to be generic.  However, project spaces which are specific and support disciplinary standardisation need to be provided alongside.
  • When building global collections, much work goes into building the trust needed to release data.  Reluctance from countries was encountered even when there were no clear security or commercial concerns to data.
  • Linking to help support discovery (alongside search) can be a driver for entering data.

The slides introducing this session are available.