Interview with Sheila Anderson and Hamish James, AHDS
Sheila Anderson is Director of the Arts and Humanities Data Service (AHDS). The AHDS acquires, curates, preserves and provides access to complex digital resources created by or supporting research and teaching in Higher and Further Education and life-long learning. The AHDS has recently established an OAIS compliant preservation repository along with a range of policy documents and preservation handbooks that describe the AHDS approach to preserving electronic texts, databases, still images, moving image, audio, GIS data, geophysics data, virtual reality materials and other digital formats.
Hamish James was until November 2004 when he returned to his native New Zealand, the Collections Manager for the AHDS. He was responsible for specifying and developing the AHDS repository infrastructure, working closely with the AHDS technical team.
Please note that due to the fact that digital curation practice changes so rapidly, this interview reflects AHDS views at the time.
Interview date: September–October 2004
- What does Digital Curation mean for you?
- How do you do it for your data? Have you considered the OAIS model, and, if so, has it been useful for you?
- How long is "long-term" preservation for your research data?
- How will your digital curation be funded?
- What tools are in use now to manage NOF-digitise data and Science Museum data that is useful for research and/or learning and teaching? - to import data? - to store data? - to locate data? - to retrieve data? - to package data to send out?
- It would also be interesting to know if you recommend tools for data creation; if so, which tools?
- We tend not to recommend specific tools. Instead we encourage data Creators to use tools that are available and supported locally and that allow for their work to be as software independent as possible.
- What tools are needed for digital curation in addition to current tools or instead of current tools? Which are most important, most needed?
- How long before current hard/soft ware will be replaced?
- What standards are in use or needed? Any protocols such as Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH)?Any models such as the Reference Model for an Open Archival Information System (OAIS)? Any metadata schemas? Any guidelines? Which are most important, most needed?
- What file formats does AHDS curate? It would be useful if you can list them, and outline any key issues. - Which file formats are most important, and why? - Are any proprietary formats? - What changes do you foresee? - What file formats present challenges now? - What are the challenges? - How is AHDS addressing these challenges?
- Roughly what quantities of data do you manage, in terms of the total volume of data, total number of files, rate of growth of archive, and growth trends? What challenges do you foresee in future?
- How is your data reused? In new studies? In creating resources for learning and teaching?
- Is data held by AHDS read by programs or by people or by both programs and people?
- Back to top What will be needed in 50 years to re-use your data?
- What do you need that you have not got to do digital curation for your data, and that you think the DCC might reasonably provide? How might DCC support your work? For example, any of the following: - Services, e.g. advice or others? - Registries? - Professional development opportunities? - Research in any particular areas?
Q1. What does Digital Curation mean for you?
We would identify digital curation as the professional creation, management, preservation and access of digital content (data and metadata) over the short, medium and longer term. This would include following best practice (where that exists) and using standards (where they exist) to create resources, undertaking necessary actions to ensure the integrity and provenance of the digital content and its continued viability and usability over time.
Q2. How do you do it for your data? Have you considered the OAIS model, and, if so, has it been useful for you?
The AHDS takes a managed approach to digital curation from start to finish. The repository infrastructure that we have established maps onto the OAIS model and we have found it useful to formally identify tasks, inputs and outputs. However, because the OAIS is an abstract model, there is a considerable amount of work involved in establishing a practical implementation of the model. In doing this, the efforts of groups investigating a wide range of topics, such as preservation metadata, repository software and file format identification, are of more immediate use than the OAIS model itself. At the same time, the OAIS model serves as a valuable shared framework for defining and coordinating these activities. As an overview of our approach outlined below, in general terms, is the approach we take:
1. Data Creation: we provide advice and guidance on best practice and standards for data creation, including Guides to Good Practice, Information Papers, and we run workshops and take part in training courses.
2. Ingest: we have established procedures and policies that we follow, including Ingest Manuals that detail the procedures for accepting and managing digital collections, and Preservation Handbooks that detail how we deal with different formats and data types. A particular topic of interest for us are tools that will help us automate our ingest procedures further. We receive a very diverse range of collections, which makes it especially difficult to automate their ingest.
3. Preservation and Archival Storage: a preservation policy that details our approach to preservation; a manual that details submission processes into our Archival store; a technical policy that details the technical approach.
4. Access and Delivery: we have an access and delivery policy that details our approach and a technical policy that details the technical standards that we use.
Q3. How long is "long-term" preservation for your research data?
How long is a piece of string?! Our approach has always been one of bringing the digital collections that the AHDS is responsible for into a 'managed environment' that would ensure their continued viability and access and that would be possible to pass over to another body or organisation should the AHDS cease to function. This requires us to follow standards and best practice and properly document our policies and procedures. We would argue that preservation is a series of managed actions that take place sooner rather than later — if you like, it could be classed as a series of short-term actions (say within a 5 year window) in order to ensure that difficult and detailed actions do not have to take place in the longer term (say 10 years and beyond).
Q4. How will your digital curation be funded?
The AHDS is funded by the Arts and Humanities Research Board and the JISC to undertake the acquisition, curation and preservation of arts and humanities data arising from or supporting research in our subject areas. At present we are funded until July 2007, with a comprehensive review of our activities and our position in the wider landscape due to take place in 2006. Following the review a decision will be taken by our joint funders of the most effective way to continue AHDS preservation activities.
Q5. What tools are in use now to manage NOF-digitise data and Science Museum data that is useful for research and/or learning and teaching? - to import data? - to store data? - to locate data? - to retrieve data? - to package data to send out?
We share the OCLC/RLG, Open Archival Information Systems (OAIS) view of a digital archive as an organisation, rather than the view sometimes expressed in the e-prints, e-learning and institutional repository fields of a digital archive as, mainly, a software system. So, the AHDS is a digital archive, and we use staff to perform some of the archives functions, while others are automated.
In this context, the category of 'tools for importing data' doesn't directly apply. We accept digital resources from our depositors on physical media or via FTP, e-mail and so on, as appropriate. We do not currently have a formalised online deposit process, like that provided by DSpace for example, because we deal with a much more diverse range of material, which varies greatly in nature (databases, video, audio, CAD, GIS, text), size (100KB to 1000GB) and rights issues. Resources are deposited with our distributed centres, who are responsible for ingest. Archival storage (OAIS functional entity) is provided by the AHDS Executive in London, and once resources have been prepared for archiving they must be sent internally to the Executive to be loaded into our digital repository. Centralised archival storage is a new development for us, and we are just working out what methods we will support for transferring data from the Centres to the Executive. Again, both physical media and network methods such as secure FTP are likely. Internally, we are currently harvesting metadata created at the Centres using OAI-PMH (with XSLT for manipulation) into a new shared catalogue system (using eXist, Lucene, Cocoon and SDX). In the future we will use an online form based application to allow Centres to edit metadata directly in the shared catalogue. We are planning to use JHOVE to create some technical metadata.
The archival store consists of a main disk and tape library store at the AHDS Executive and a secondary off-site store, based on a tape library and disk storage, which are also accessible via Storage Resource Broker (SRB) software). The metadata model supported by SRB is rather simple (key, value pairs), so we don't plan to use SRB for managing more than the internal metadata it needs to track the location of resources.
We make a clear distinction between the ingest/archival side of the repository and services for delivery data to users. In the past, each Centre has developed its own tools, ranging from simple file download services to online analysis tools, but we are now in the process of integrating the existing tools and developing new ones that can be used across the entire AHDS. Our general approach is to adapt existing open source projects and create utilities where they are useful (e.g. we have created tools for mapping between identifier schemes and for extracting row/column subsets from delimited text files). Our central services are based on Java servlets and developed using open source tools (Ant, CVS).
Q6. It would also be interesting to know if you recommend tools for data creation; if so, which tools?
We tend not to recommend specific tools. Instead we encourage data Creators to use tools that are available and supported locally and that allow for their work to be as software independent as possible.
Q7. We tend not to recommend specific tools. Instead we encourage data Creators to use tools that are available and supported locally and that allow for their work to be as software independent as possible.
There are a lot of tools already available that can perform one or two tasks useful for digital curation. Utilities for generating or comparing checksums are one example. Tools for managing the overall processes of curation are missing. So, for example, we are considering developing a new database application internally to help track work done during ingest, and a tool for coordinating the creation and editing of preservation metadata (e.g. allowing manual entry and capturing output from products like JHOVE and exporting them all to an appropriate metadata format). A ready to use tool for tracking the contents of a repository and checking it against file format data in PRONOM or another database would be useful, although PRONOM itself may develop in this direction. The Metadata Encoding and Transmission Standard (METS) is gaining ground as a metadata standard for digital curation, and tools for creating and editing METS documents conforming to various profiles would be very useful. Tools that would help validate the significant properties of a resource after it has been migrated from one file format to another would be extremely useful.
Q8. What tools are needed for digital curation in addition to current tools or instead of current tools? Which are most important, most needed?
Servers will probably be replaced in three to four years, storage devices will be replaced depending on the growth of our collections verses the declining cost of storage.
Q9. How long before current hard/soft ware will be replaced?
We recommend standards appropriate to the data type and subject area of Data creators. We recommend certain file formats, based on the level of software support, stability, and openness of the format. We recommend the use of proper ASCII or UNICODE, but are not specific about which method of encoding is used (UTF-8 and so on). We recommend certain approaches to structuring data that can be employed in any number of file formats, especially the relational data model and XML. We recommend specific methods of organizing and describing data, like the TEI. For descriptive metadata, our Centres tend to recommend the recognised standards in their subject areas — standards like TEI (TEI header), DDI, VRA 3.0 core, and so on.
Q10. What standards are in use or needed? Any protocols such as Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH)?Any models such as the Reference Model for an Open Archival Information System (OAIS)? Any metadata schemas? Any guidelines? Which are most important, most needed?
At the AHDS, we identify three logical versions of each collection deposited with us: the original version, comprised of the actual files and other material deposited; the preservation version comprised of files in formats suitable for long-term preservation; and, the dissemination version, in formats which are easy to use at the present time. These three logical versions of the collection may or may not involve the same actual files. The original version could include files of any file format, and we curate this to the extent that we will provide bit preservation. The preservation version is comprised of files in formats that we think will be stable and accessible for a long period of time. Key factors in accessing this are the openness of the file format specification, the level of software support, the level of use of the format, the stability and backward compatibility of the file format and supporting software. Because we deal with a whole range of material, there are a whole range of file formats that are important for preservation. If files supplied by the depositor are not appropriate for long-term preservation, we will create the preservation version during ingest (migration at ingest). We've identified 16 categories of digital resource we deal with, and are currently finalising a set of preservation handbooks that give specific guidance on how to deal with each of them:
- Plain text
- Markup
- Binary Text/Word processor documents
- Relational Databases
- Statistical data file
- Spreadsheet
- GIS
- CAD
- Bitmap (raster) image
- Vector graphics
- Virtual reality
- Audio
- Moving image
- Geophysics data file
- Scripts
For example, the statistical data file handbook recommends using the SPSS format for preservation, or delimited text files with appropriate documentation. The handbook for binary text/word processor documents, recommends conversion to XML markup or plain text if possible, but if not, recommends Microsoft RTF. As these examples show, we do use some proprietary formats. In areas like GIS and CAD there are a lack of suitable open alternatives. You can get an idea of the more important formats that we deal with by looking at our preferred and acceptable deposit formats list [external].
File formats problems really means problems with the software needed to render the data. Problems occur when it is difficult to get software that can accurately render the data in a file format, and this usually occurs because the software is no longer available, or is too expensive. Most file format specifications are not readily available, and this makes it difficult to find alternative ways of reading formats for which software is not readily available. Realistically though, the AHDS lacks the resources to create software to read complex obsolete formats. The AHDS attempts to minimise these problems by encourage our depositors to use sensible formats (open, widely supported, and so on) when creating their files, and migrating data from formats that are not appropriate for long-term preservation at ingest, when, hopefully, software to render and export the format is still obtainable.
A major change for the better is the increasing use of XML as a standard way of defining file formats. So, for example, even if you do not have Microsoft Excel, you can still do something with an Excel workbook saved as an XML document. Increasing use of UNICODE is also helping with the multilingual resources we receive. We would expect to see more XML based formats being deposited across most of the digital resource categories listed above. A new problem we may have to deal with is an increasing number of poorly documented custom XML DTDs and schemas.
Real Audio is compressed (heavily) and so would be unethical as an archive format (unless nothing better were available). We archive the uncompressed signal as Broadcast Wave Format (BWF) files, and make Real Audio files as an 'access copy' for intranet access.
Q11. What file formats does AHDS curate? It would be useful if you can list them, and outline any key issues. - Which file formats are most important, and why? - Are any proprietary formats? - What changes do you foresee? - What file formats present challenges now? - What are the challenges? - How is AHDS addressing these challenges?
We are in the process of moving all our data into a new central repository. At the moment, the total collection size is about 1.5 terabytes — we will have an accurate figure in a few weeks once all the material has been transferred to the new repository. Number of files is probably around one million.
Our new repository has an initial capacity (disk and tape library) of around 15 terabytes. This large margin for growth is needed to allow for the possibility of rapid and unpredictable growth in our collection. Particular types of digital content, especially images, audio and audio/video, generate very large volumes of data. The main component of any significant increase in the volume of data (as opposed to numbers of digital resources) deposited with the AHDS in the next few years will probably be digitised images. At the moment, digitised images form about half of the total data volume held by the AHDS, although they constitute only a mere 20 or so of our 3,000 collections. The largest confirmed future deposit with the AHDS is a collection of digitised images that will total between two and three terabytes of data — a single collection that will be at least twice the size of our total holdings at present. In short, a small number of deposits may dramatically alter the total size of our holdings.
Q12. Roughly what quantities of data do you manage, in terms of the total volume of data, total number of files, rate of growth of archive, and growth trends? What challenges do you foresee in future?
To encourage reuse, and because access to most of the collections we hold is not restricted, we do not require users to go through a complex registration process. While this is convenient for our users, it does make it difficult for us to find out exactly what they do with collections they download.
The AHDS holds digital surrogates of analogue texts, images and recordings, and more complex collections of information selected, summarised or modified from other analogue or digital sources — things such as historical census databases and the archives of archaeological field work. In short, we hold research resources, and these can be reused in a variety of ways. Most obviously, they can be used in further research. They might also be used in learning and teaching, but few of our collections have been designed specifically for this purpose, either in terms of content (instructional material, assessments, and so on) or structure (they are not held as SCORM objects or similar).
Q13. How is your data reused? In new studies? In creating resources for learning and teaching?
The collections we hold represent complex outputs from research projects, and they are often quite varied internally (XML indexes linking to images and transcriptions from original sources, with associated working papers written by members of the research team, for example). Collections held by the AHDS, with the exception of our image collections, are typically not comprised of simple, discrete items that can be easily used out of context.
Therefore, apart from metadata interoperability, our collections are not readily suited to machine-to-machine processing. They are designed to be used by people, although the first thing someone may do is load the material into a program to perform some type of analysis.
Q14. Is data held by AHDS read by programs or by people or by both programs and people?
Ignoring the numerous imponderable factors, the requirements for reuse after 50 years will be set by the approach taken to digital preservation. The AHDS follows a primarily migration-based approach to digital preservation. For each collection we store an original version and a preservation version (as well as dissemination versions, which are not relevant here):
- The original version consists of all the files given to the AHDS by a depositor, along with scanned copies of the AHDS licence form, catalogue form and data and documentation transfer form. This version will potentially allow for emulation and related approaches to be applied in the future. Some of the data types we hold should be usable in 50 years providing simply that the files are stored securely and do not become corrupted. I'm confident that a single table dataset stored as a delimited text file will be easily read in 50 years and, as today, it will be a simple matter to import it into a database or spreadsheet style of application for analysis. In this case, all that will be needed to use the data in 50 years time is the understanding that the data file represents a table containing rows and columns. Fully informed use of the data will then depend on understanding what the values in each cell of the table represent, so documentation will be needed. We encourage our depositors to provided detailed documentation on the provenance, structure and semantics of their data when it is deposited.
- The preservation version consists of the information content (i.e. the images, sounds, text, and so on; not the bytes) encoded in the files of the original version, but periodically migrated so that there is always a version of the content held in a stable and readable format. Still image, audio, moving image, GIS, and 3D modelling (CAD/virtual reality) formats are all likely to need migration to keep pace with changes in technology. Therefore, to be usable in 50 years time the AHDS or another organisation will need to migrate these files, and validate the migration, one or more times in the intervening half-century. Because we store the original files are well as a migrated preservation version, two alternative approaches are also possible. Firstly, someone could potentially acquire and store the necessary file format specifications for fifty years and then write new software to read the originally deposited files. Secondly, someone could store the original software used to read the files for fifty years, and an emulator might then be used to run it. The AHDS is not actively pursuing these approaches at the present time.
IPR restrictions may still affect data after 50 years, so framing the least restrictive IPR agreements, and maintaining a clear record of relevant IPR will be important to allowing reuse after 50 years.
Q15. Back to top What will be needed in 50 years to re-use your data?
Again a question that can be answered in many different ways. Trained staff and good tools are needed to 'do' digital curation. Both these are difficult to find, and we are looking to the DCC to help address these issues. From our perspective, advanced, rather than introductory, training and advice services are needed to help equip staff to establish repository policies and procedures and actually perform activities such as migration. Training courses need to be a week long, not a day long. Advice needs to be provided through direct contact with experts, not through FAQs and pamphlets.
Useful support for tools could range from timely and informed reporting on available tools, focusing particularly on how tools not specifically designed for curation might be modified/applied in a digital archiving situation, through to actually developing tools.
The DCC should take on a technology watch role, although this might mean more of a coordination and dissemination role rather than a role actually tracking and analysing developments.
Staff at the AHDS have found the ERPANET workshops and seminars very useful events, and we would encourage the DCC to consider running similar events in the future.
- Home
- Digital Curation
- About Us
- News
- Events
- Resources
- Curation Reference Manual
- Curation Lifecycle Model
- Policy and Legal
- Data Management Plans
- Case Studies
- Tools and Applications
- Briefing Papers
- Introduction to Curation
- Annotation
- Appraisal and Selection
- Curating e-mails
- Curating e-science data
- Curating geospatial data
- Data accreditation
- Data protection
- Database archiving
- Digital repositories
- Freedom of Information
- Genre classification
- Interoperability
- Persistent Identifiers
- Trust through self audit
- Using OAIS for curation
- Web 2.0
- What is digital curation?
- Legal Watch Papers
- Standards Watch Papers
- Technology Watch Papers
- Introduction to Curation
- Standards
- Publications
- External Resources
- Roles
- Curation Journals
- Training
- Projects
- Community
- Contact Us
Closing the Digital Curation Gap
Closing the Digital Curation Gap
Data curation is often carried out by information practitioners with little training or experience. The Closing the Digital Curation Gap (CDCG) collaboration unites those at the cutting edge of digital curation research, development, teaching and training with the aim of creating good practice guides covering all aspects of data curation.
