Bryan Lawrence on metadata as limit on sustainability

5 December, 2008
Opening the Sustainability session at the Digital Curation Conference, Bryan Lawrence of the Centre for Environmental Data Archival and the British Atmospheric Data Centre (BADC), spoke trenchantly (as always) on sustainability with specific reference to the metadata needed for preservation and curation, and for facilitation for now and the future. Preservation is not enough; active curation is needed. BADC has ~150 real datasets but thousands of virtual datasets, tens of millions of files.

Metadata, in his environment, represents the limiting factor. A critical part of Bryan’s argument on costs relates to the limits on human ability to do tasks, particularly certain types of repetitive tasks. We will never look at all our data, so we must automate, in particular we must process automatically on ingest. Metadata really matters to support this.

Bryan dashed past an interesting classification of metadata, which from his slides is as follows:
  • A – Archival (and I don’t think he means PREMIS here: “normally generated from internal metadata”)
  • B – Browse: context, generic, semantic (NERC developing a schema here called MOLES: Metadata Objects for Linking Environmental Sciences)
  • C – Character and Citation: post-fact annotation and citations, both internal and external
  • D – Discovery metadata, suitable for harvesting into catalogues: DC, NASA-DIF, ISO19115/19139 etc
  • E – Extra: Discipline-specific metadata
  • O – Ontology (RDF)
  • Q – Query: defined and supported text, semantic and spatio-temporal queries
  • S – Security metadata
The critical path relates to metadata, not content; it is important to minimise the need for human intervention, and this means minimising the number of ingestion systems (specific processes for different data types and data streams), and to minimise the types of data transformations required (problem is validating the transformations). So this means that advice from data scientists TO the scientists is critical before creation; hence the data scientist needs domain knowledge to support curation.

Can choose NOT TO TAKE THE DATA (but the act of not taking the data is resource intensive). Bryan showed and developed a cost model based on 6 components; it’s worth looking at his slides for this. But the really interesting stuff was on his conclusions on limits, with 25 FTE:
"• We can support o(10) new TYPES of data stream per year.
• We can support doing something manually if it requires at most:
– o(100) activities of a few hours, or
– o(1000) activities of a few minutes, but even then only when supported by a modicum of automation.
• If we have to [do] o(10K) things, it has to be completely automated and require no human intervention (although it will need quality control).
• Automation takes time and money.
• If we haven’t documented it on ingestion, we might as well not have it …
• If we have documented it on ingestion, it is effectively free to keep it …
• … in most cases it costs more to evaluate for disposal than keeping the data.
• (but … it might be worth re-ingesting some of our data)
• (but … when we have to migrate our information systems, appraisal for disposal becomes more relevant)”
Interestingly, they charge for current storage costs for 3 years (only) at ingest time; by then, storage will be “small change” provided new data, with new storage requirements, keep arriving. Often the money arrives first and the data very much later, so they may have to be a “banker” for a long time. They have a core budget that covers administration, infrastructure, user support, and access service development and deployment. Everything changes next year however, with their new role supporting the Inter-Governmental Panel on Climate Change, needing Petabytes plus.