Because good research needs good data

Life cycle approach

Chris Rusbridge | 29 May 2007

Digital Curation is maintaining and adding value to a trusted body of digital information over the life cycle of scholarly and scientific materials, for current and future use. It is our belief in the DCC that the curation of digital data requires this whole of life approach. Critical decisions on the curation of data are taken before the data are even created, often at the time the associated project is conceived, or funding is sought. This is not least because curation requires resources that must be allowed for within the work plan. It is increasingly clear that for any project involving data of value, you should provide a data management plan within the project proposal (NSF, 2007).Digital curation includes good management of data for current purposes, and also in many cases the preservation of those data for the long term. Long term preservation is not necessarily an essential part of curation in all cases, although it is usually a desirable aspect (subject to appraisal and selection decisions). So we can think of curation as having two important components, which we can label “data publication”, for the process of making current data available for use by other contemporaries, and “data preservation”, for the process of making those data available for future users.Data publication recognises that more and more “reference” works are migrating into the digital domain as curated databases, and that increasingly these are data (or sometimes combinations of data and text) rather than pure text. There are interesting questions on when and how a dataset is “published” such as those raised by Bryan Lawrence in the linked post, but I’ll skip those for now! Such reference datasets can change quite frequently, including the correction or deletion of information as well as the addition of new. The requirements for integrity and stability versus the need for change to promote accuracy bring special problems. During development of the resource, if you are interested in the long term, you will have to ensure that contextual and other information needed for preservation are gathered. This may have to happen even before you make firm decisions about preservation!As your dataset stabilises, and in particular as it comes out of current use, it may be eligible for long term preservation. This is not an automatic choice, as resources are currently spread too thinly to preserve everything. It will be important in any data management plan to identify candidate archives for preservation that serve the appropriate “designated community” (perhaps your scientific discipline).The locus of preservation remains an issue. Where applicable and stable, a discipline-oriented repository should ensure that selected data are curated for the long term by domain experts (unfortunately the recent disturbing decision by the AHRC to end funding of the AHDS provides a worrying precedent). The choice of repository will be important in deciding in the first case what associated information is required, and later in treating the data at critical stages in the life cycle, such as changes in the technological infrastructure or the designated community.Should there be no discipline-oriented repository, an institutional or other local repository may be appropriate… but may be quite un-prepared to deal with data (on a recent search using OpenDOAR, only 5 institutional repositories in the UK claimed to include data, and some of those claims were false)! Keep at them…In both cases, ensuring that users in your designated community can easily understand the meaning or information content of the data is critical. The Open Archival Information System (CCSDS, 2002) has useful recommendations for supporting the long term understandability of information, and is applicable to digital curation for that reason, although at the moment we are short of good examples of some of its concepts, and it does require some interpretation!CCSDS (2002) Reference Model for an Open Archival Information System (OAIS). IN CCSDS (Ed.), NASA.NSF (2007) Cyberinfrastructure Vision for 21st Century Discovery. Arlington, Virginia, National Science Foundation.