Interview with Neil Thomson, Natural History Museum
Neil Thomson has worked at the Natural History Museum for longer than he cares to remember, most recently as Head of Data and Digital Systems with a particular interest in the development and use of information standards. The Natural History Museum promotes the discovery, understanding, enjoyment, and responsible use of the natural world through its exhibitions, collections and research.
- What does Digital Curation mean for you?
- How do you do it for your data?
- Have you considered the OAIS model, and, if so, has it been useful for you?
- What tools are in use now to manage NOF-digitised data and Science Museum data that is useful for research and/or learning and teaching? - to import data? - to store data? - to locate data? - to retrieve data? - to package data to send out?
- What tools are needed for digital curation in addition to current tools? Or instead of current tools? Which are most important, most needed?
- How long before current hardware/software will be replaced?
- What standards are in use or needed?
- Back to top Are there any other standards you are considering? For the video material, have you looked into standards developed for and by broadcasting organisations? Any other guidelines needed? What about metadata schemas? Of all these, which are most important, most needed?
- You said you are in the process of collecting more information from these data creators through a survey. It would be useful if you could list the file formats in use, but most important is to bring out key issues relating to file formats. In particular: - Which are most important, and why? - Are any proprietary formats? - What changes do you foresee?
- How is your is your scientific research data reused? In new studies? In creating resources for learning and teaching? For this reuse, is it read by programs or by people?
- What will be needed in 50 years to re-use your research data?
- How might DCC support digital curation work? For example, could it help by providing: - Services, e.g. advice or others? - Registries? - Professional development opportunities?
Q1. What does Digital Curation mean for you?
We at the Natural History Museum (NHM) are at the early stages of recognition of the looming problems associated with digital data.
We have formed a Digital Sustainability Group. This was originally called a Digital Preservation Group, but we decided that that name gave the impression that is was only concerned with data for which the main purpose had been fulfilled — like a paper archive. However, we recognised that there needed to be active management of current data in order to fulfil the aim of ensuring that the data will be continuously available into the indefinite future.
The Group consists of our Assistant Archivist (main Archivist post is currently vacant), the Head of IT and myself, Head of Data and Digital Systems. This seems to cover the main domains that need to collaborate to do something meaningful here. We have met about three times and have agreed on the importance of the topic.
Q2. How do you do it for your data?
Incidentally, we are fortunate in that, being publicly funded, we have access to National Digital Archive of Datasets (NDAD) as a repository of at least some of the data that we generate. The NHM, as you may know, is a major research institute in addition to the exhibitions for which it is better known. We have around 350 science staff all using computers with their work. This is in addition to a large administration and website all generating digital data.
The Group has produced some "Next steps" for its own guidance (at this stage) and I attach these for background information since they give a sense of where we are right now. Our main discussion topics are:
- awareness raising throughout the organisation
- getting digital sustainability embedded in the culture, especially by making it an integral part of any new project. Useful here that we are about to roll out a new collections management system that will be used for all the science collections. The Head of IT is much involved and keen to ensure that preservation activities, such as contextual metadata, are a standard part of this.
- produce a strategy paper for senior management (initial headings in the First Steps doc)
- dataset survey: We agreed that it would be a good idea to prepare an inventory of what exists in the Museum, to get a handle on the scale of the problem and to start recording useful data on formats, and so on. We are approaching this in the order:
- what purposes do we want this to fulfil
- what data do we need to achieve that
- what questions do we need to ask in a simple questionnaire to get the data — tie-ins: how various systems relate with each other, such as collections management systems (science and the library); digital asset management system (DAMS) (commercial and academic flavours possible); External queries made under DP Act, FoI and EIR; the CMS for the website.
Our survey will be widely promoted both top-down to get senior understanding and support (as a similar exercise some years ago failed because senior management did not want their staff wasting time listing databases when they should be doing research) and bottom-up to actually get the data.
Our view is that digital data will be subject to the same review process as paper data, with some (datasets in particular) requiring indefinite accessibility. How this will be funded is uncertain thus far. We need to be clearer about what will be involved.
The main purpose for the (meta)database will be so that we can run reports on the physical and data formats of NHM material found through the survey and match these against the entries in PRONOM. We hope that this will act as an early warning of material that is about to become technologically obsolescent. We can then work out a data migration strategy. A registry such as PRONOM is seen as vital.
Advice on open standards would be similarly be much appreciated. We have, for example, many long hours of video tapes of microscopic beasties which the owner wishes to transfer to a digital format. I would like to recommend an open standard for preservation purposes, but there don't seem to be one that also satisfies the research needs and since those have to come first, they are being transferred into a proprietary scheme. At least they will be digital and I have great faith that a migration strategy will become possible if one does not already exist. Much effort is being put into format interchange and even emulation, which may become important for some material.
Q3. Have you considered the OAIS model, and, if so, has it been useful for you?
OAIS: Yes I am very aware of this, but seem to be the only one. Therefore I have volunteered to give a presentation on what OAIS is about to the rest of the DSGroup and hope that we will be able to make a test implementation. This will probably not happen until the end of this year or the beginning of the next, now.
Q4. What tools are in use now to manage NOF-digitised data and Science Museum data that is useful for research and/or learning and teaching? - to import data? - to store data? - to locate data? - to retrieve data? - to package data to send out?
We are currently moving from the traditional museum model of having literally several hundred databases, all of which are different and most of which are in software from different vendors. We have an interim position but are actively moving to our preferred position of having a single system for all science data.
We aim to have one system for science (specimen) data, which is KE EMu; once system for library material management, which is Sirsi Unicorn; one system for archives information, which is DS CALM and then a single DAMS system which is currently being specified.
The interim situation uses an in-house system called MILS (Museum Information Locator System). MILS serves 2 main functions. Firstly it is a unified front end to all of our disparate databases that are Web-enabled - currently around 45. So a single search returns results from all (or a selection of) the databases. This was because I thought that the move to a single, or low number of, database(s) for NHM data was unlikely. I am happy to be wrong on that.
The second purpose is as a data extraction and transformation tool and MILS will continue in this role. This is primarily valuable in providing subsets of our data in a format that the requesting audience requires (preferably in an XML application), bit will also be valuable in migrating our data from existing systems to the new KE setup or for system upgrades, and so on.
MILS therefore fulfils all the roles mentioned in your question above, regardless of the system. It also provides an OAI repository through harvested searching in addition to the distributed searching outlined above.
Our Web Team is currently implementing a CMS (Content Management System) for which I hope some consideration has been given to linking with the forthcoming digital asset management system (DAMS), amongst other things.
Q5. What tools are needed for digital curation in addition to current tools? Or instead of current tools? Which are most important, most needed?
We do need:
- an inventory of datasets with format and contextual metadata
- access to a registry of data formats, such as PRONOM which contains information on which programs can read and write each format and suggested migration paths and "move-by" dates
- awareness-raising materials. One set for senior management another for data owners and a third for IT staff. This last about OAIS.
- a checklist of features that should be present in an OAIS compliant system that could be incorporated into an Operational Requirement
Q6. How long before current hardware/software will be replaced?
Our server hardware is expected to last up to 5 years, or up to the point where support is withdrawn by the manufacturer. The KE EMu rollout has started, so no new databases are allowed to be started. This will take an estimated 3 years to complete. As there are now a number of major natural history organisations that use EMu a specialist user group has been started. We hope that through a partnership with KE the software will evolve to service our requirements more accurately and thus avoid the pain of future migrations to a certain extent.
Q7. What standards are in use or needed?
There are a whole raft of data and information standards that are in development for natural history organisations under the auspices of TDWG (Taxonomic Databases Working Group) in which I am pleased to have an involvement. All these standards are based on XML and are being (or will be) adopted by GBIF (Global Biodiversity Information Facility).
Q8. Back to top Are there any other standards you are considering? For the video material, have you looked into standards developed for and by broadcasting organisations? Any other guidelines needed? What about metadata schemas? Of all these, which are most important, most needed?
Well, I was quite taken by the MPEG 7 standard, but the scientist in question was not. Since the needs of research come first, he will use his favoured standard and we need to register that and keep a watch on its continuing viability for now. I am keen to promote the use of XML whenever possible. Not necessarily holding material in raw XML, but ensuring that any systems we have will be able to import and export XML either directly or via MILS. I also want to catch up on the interesting work being done in Australia on the use of XML as a preservation format — even for binaries such as images.
For our images we already use a mixture of VRA Core and RLG Preservation Metadata fields. A next step is to match that against the NDAD requirements and adjust if necessary. A fully agreed ISO (or similar — DCC?) standard for preservation metadata would be valuable in providing a reputable and stable set of fields that could be recommended to database developers, such as KE, so hat that data becomes a core and integral part of databasing activities, rather than a bolt-on or completely divorced activity.
At a recent planning meeting, it was recognised that digital sustainability is a huge and important looming problem. Our Head of IT has agreed to make a presentation on this to the top level Information Services Group so that meaningful decisions can be made.
Q9. You said you are in the process of collecting more information from these data creators through a survey. It would be useful if you could list the file formats in use, but most important is to bring out key issues relating to file formats. In particular: - Which are most important, and why? - Are any proprietary formats? - What changes do you foresee?
Yes, the survey will take place towards the end of this year, it is hoped. I'm sure that it will reveal a staggering number of different file formats. I'm afraid that I could not list them all right now, but the most common will be MS Office formats, which are proprietary. Being scientists, many programs will be very specialised and probably with a limited market which is unsupported. These will need especial attention I think.
We even produce our own software (see under "Databases & stats packages" [external]) which (I think) holds data in its own internal format but does at least have a wide variety of data import and export possibilities.
One type of material that is at high risk is email. We currently have no way of capturing project-oriented email in the same way that written letters were previously kept as part of the official record and unless these are printed out — which many are — they are at great risk of being lost.
We are moving towards firstly a single collections management system for our specimens — which I have already mentioned — and expect that this move will reduce the number of different database formats that we will need to deal with. It is interesting to note that one of the drivers for choosing KE EMu was that several other major natural history research institutions also have that system and there is a certain belief in safety in numbers. If, for example, the company went down then between us we should be more able to muster the resources to create a viable future for the data than if we were the only such user.
Secondly we are also aiming to have a DAM system within a year as a joint venture with the V&A. This will be valuable for a number of reasons, but as a later development (it is initially for images/video/sound files) could at least hold e-mail.
As to volumes, the DAMS is expected to hold 12,000 NHM images initially. The survey will reveal how much data is held elsewhere, but I do know that our current SAN of 6 terabytes is pretty well full. I also believe, although I can't yet back this up (so to speak) that there is more data on local C:\ drives that on the shared SAN.
Our digitisation programme and digital library ambitions look like they will generate at least 12 terabytes over the next few years. Much of that will be archival storage of TIFF master bitmaps which is likely to be stored offline or nearline.
We are interested in supporting the Biodiversity Commons movement (open access publishing) and will be doing so in collaboration with our American colleagues. This also has the potential of generating huge quantities of digital material as we mass digitize back runs of our own published journals.
Data formats (e.g. PDF) and standards (e.g. METS) will be important, obviously. Since the early published material in natural history is as important as the new (unusual in a science) through the principle of "precedence of publication" in naming organisms, assured access into the indefinite future is vital, so consideration of digital sustainability issues will be required right from the start.
Many of the challenges will, as ever, be organisational rather than technical. Persuading folk to record metadata that is of no perceived direct benefit to them will be one. Finding the money to manage data migration effectively and for reliable systems will be another. Reducing dependency on local storage; legislative drivers; assurance that what comes out of the system is exactly what went in; managing access permissions and securing the onward distribution of material also spring to mind.
Although these are not directly data format concerns, with the possible exception of using the security features that PDFs offer, they are likely to be the major challenges of the future and will require a change in the culture.
I am much less worried about dynamic data than I was — at least for the most popular formats such as Flash. Having seen the explosion in emulations, started by the gaming fraternity and continuing with the computer historians, it seems that it is (will be?) possible to emulate just about anything on just about anything, whether sanctioned by the IPR holders or not. The legality of that will need exploring, but the concept of Abandonware is already real.
Q10. How is your is your scientific research data reused? In new studies? In creating resources for learning and teaching? For this reuse, is it read by programs or by people?
Re-use for learning is currently being examined by our new Head of Learning who is developing a complete learning strategy for the Museum. My own interest is being able to provide resource discovery metadata or to have links to our systems from educational portals. An example of the latter is the recent interest shown by the National Grid for Learning (NGfL) in our Nature Navigator [external].
The major re-use of our data is in its provision to National, European and Global aggregation services. An example of the first is the National Biodiversity Network [external]; the second is BioCASE [external], the Biological Collections Access Service for Europe, and the last is the Global Biodiversity Information Framework [external].
Our involvement in each is slightly different, but the aim is to provide our data in a standard (XML) format so that it can be integrated with that from other biodiversity research organisations. The principle is that the enriched data sets provide a more accurate view on what lives where so that decisions on land management (for example) can be better informed.
As such, the data is read by both humans and machines, with machine systems providing access to data through distributed (wrapped) searches or through overnight data harvesting.
Other research data is published by traditional means, although the open access business model is being actively investigated.
Q11. What will be needed in 50 years to re-use your research data?
Who knows! Best guesses include something that can:
- read ascii data and interpret XML tags
- render images
- render distribution maps from spatial co-ordinates
- emulate anything else
- read the contextual metadata that explains what the data is about and where it is located
Q12. How might DCC support digital curation work? For example, could it help by providing: - Services, e.g. advice or others? - Registries? - Professional development opportunities?
Your bringing together of case studies will be useful, I think. Even more so if they can be supplemented periodically by updates.
Certainly the bringing together of awareness-raising materials that could re-used by organisations would be very valuable. This should be aimed at at least 3 audiences (I may have said this before?). One set for senior management, one set for data owners and one set for IT and Archive staff.
Regarding advice, something like a timeline of actions could be developed out of the case studies that show just how long is the average path from initial awareness to fully functional Open Archival Information System (OAIS) setup and laying out a sequence of events that would aid success. Many organisations, like us, are just in the very early stages and may well value something that shows the scale of the problem realistically. On the other hand, it should not scare folk too much. I know that there are not supposed to be problems any more, only opportunities — but this could easily be viewed by some as an insurmountable opportunity.
You would not necessarily need to develop data format registries yourselves, but should monitor, encourage and recommend those that are being developed, such as by TNA and DLF. A sort of registry of registries? Perhaps offering to mirror them for ease of local use and security?
- Home
- Digital curation
- About us
- News
- Events
- Resources
- Briefing Papers
- Introduction to Curation
- Annotation
- Appraisal and Selection
- Curating Emails
- Curating e-Science Data
- Curating Geospatial Data
- Data Accreditation
- Data Citation and Linking
- Data Protection
- Database Archiving
- Digital Repositories
- Freedom of Information
- Genre Classification
- Interoperability
- Persistent Identifiers
- Trust Through Self Audit
- Using OAIS for Curation
- Web 2.0
- What is Digital Curation?
- Making the Case for RDM
- Research Data Readiness
- Legal Watch Papers
- Standards Watch Papers
- Technology Watch Papers
- Introduction to Curation
- How-to Guides
- Curation Reference Manual
- Peer review
- Editorial Board
- Completed chapters
- Appraisal and Selection
- Archival Metadata
- Archiving Web Resources
- Curating Emails
- File Formats
- Investment in an Intangible Asset
- Learning Object Metadata
- Metadata
- Ontologies
- Open Source for Digital Curation
- Preservation Metadata
- Preservation Strategies
- Principles for Enabling Access to Engineering Design Information Through Life
- Chapters in production
- Curation Lifecycle Model
- Policy and legal
- Data Management Plans
- Tools
- Case studies
- Repository audit and assessment
- Standards
- Publications and presentations
- Roles
- Curation journals
- Informatics research
- External resources
- Briefing Papers
- Training
- Projects
- Community
SCARP Synthesis Study
SCARP Synthesis Study
Shedding light upon the diversity of scientific research is this DCC-commissioned report, based on SCARP and other case studies. Attitudes and approaches to data deposit, sharing, reuse, curation and preservation are investigated across a range of research fields and disciplines.
