Because good research needs good data

Scientific Metadata

Clive Davenhall, National e-Science Centre

Scientific data are generated by experiments or observations. In order to be interpreted, or even accessed, they must be accompanied by auxiliary information, ranging perhaps from the experimenter and the time and place that the experiment was conducted to arcane calibration details.

This auxiliary information constitutes the metadata for the dataset. It is similar to the metadata needed for non-scientific data, but with some distinctive features; notably it is likely to be more extensive and less standardised.

In order to be properly interpreted by either humans or software, metadata items need to be precisely defined. Similar quantities are often subtly different. Numeric values are meaningless unless their units are known. Scientific data often have a small and specialised initial user-community. If the data are to be re-used outside this community additional adumbration or exegesis may be required.

Even for its initial user-community scientific metadata is often notoriously incomplete. Additional quantities and assumptions necessary to interpret the data may initially only be recorded on scraps of paper, hard-coded into analysis software or only exist in the experimenter's head. Considerable effort must be made to capture all this information if the data are to be retained for posterity or made available to a wider community of users.

Finally, standards are important to promote interoperability. Because of the small and specialised user-base standards can be informal, specialised and change rapidly. Thus it is necessary to monitor and track them.

Download the Scientific Metadata instalment

Key Points

  • Scientific data are generated by experiments and observations as part of the scientific process. That is, they are ultimately experimental, rather than routine. This point is less important for large, long-lived, collaborative experiments, but is ultimately inherent to the scientific process.
  • Scientific metadata are likely to be more extensive and less standardised than non-scientific metadata.
  • Scientific datasets are often generated with incomplete metadata. Considerable effort may be required to ensure that all the metadata necessary to make the data re-usable are gathered and ingested.
  • Scientific user-communities are often small and specialised. If data are to be used outside their original communities, or preserved for an extended period of time, additional exegesis may be required.
  • Standards, both syntactic and semantic, are needed to facilitate interoperability and re-usability. Physical quantities need precise, and documented, definitions and numeric values must have known units.
  • Standards may be specific to the specialised community that generated the data. Because the communities are small standards (and other practices) can evolve rapidly and so must be tracked.