Interview with Laura Mitchell, National Archives of Scotland

  1. What does digital curation mean for you, in general terms?
  2. In very general terms, how do you do it for your data?
  3. Have you considered the Open Archival Information Systems (OAIS) model, and if so, has it been useful for you?
  4. How long is "long-term" preservation for the National Archives of Scotland?
  5. How will digital curation be funded? How long is your funding horizon?
  6. What tools are in use to import data?
  7. What tools are in use to store data?
  8. What tools are in use to locate data?
  9. What tools are in use to retrieve data?
  10. What tools are in use to package data to send out?
  11. What tools are needed for digital curation in addition to current tools? Or instead of current tools?
  12. Which are most important, most needed?
  13. How long before current hardware/software will be replaced? What standards are in use or needed?
  14. What standards are in use or needed?
  15. What file formats are you using? Could you list them and highlight any issues such as: which file formats are the most important and why, are there any proprietary formats, and what changes do you foresee?
  16. Am I right in assuming that the primary use of the NAS digital archives is "evidential" — if that is the correct term?
  17. Is the data expected to be reused, considering reuse as being by people other than those who created it and/ or for purposes other than those for which it was created, e.g. historic or social research?
  18. Will the data be read by programs or by people or both?
  19. What will be needed in 50 years to re-use your data?
  20. How might DCC support digital curation work? For example, could it help by providing: - Services, e.g. advice or others? - Registries? - Professional development opportunitie

Q1. What does digital curation mean for you, in general terms?

It means all the things which "curation" means for us in the paper archive world — the selection, preservation and making available of digital objects which are of national significance to Scotland, including records from government, the courts, businesses and private individuals. These will increasingly be "born digital" and will also be increasingly complex (websites, CAD systems, digital film and sound, etc). They will also include material which was not born digital but where only digital objects are available for transfer to the National Archives of Scotland (NAS) (e.g. where an organisation has scanned paper and disposed of the originals).

Increasingly, we will also have our own digital material to look after, created as the result of digitisation projects (e.g. the digitisation of Scotland's wills and testaments [external]). These are slightly different to the digital archive material described above, as the original paper records are still available as ultimate backup, but many of the issues of long-term preservation and access will be common to both categories of material.

Q2. In very general terms, how do you do it for your data?

We are working on this! We have worked out what needs to be done in order to accession, check and store archival digital objects and have just begun a Digital Data Archive (DDA) Project to put in place the technical and procedural infrastructure to go ahead with this. It is based on various principles, e.g. that only archive material be stored in the system (as opposed to images created as the result of NAS digitisation projects), that as much work as possible be automated (e.g. metadata extraction, checks for format degradation, and so on), that the original format of the digital object is retained even if migrated formats have to be created for preservation/access purposes, that enough backups exist to reinstate any data which might become corrupted, and that there is no direct connection between where the objects are stored and where access is given to them. In the meantime, the digital material we have received so far has been checked using an MD5 algorithm and is stored on CD, with backups in a separate building. The CDs are checked regularly to ensure that no data corruption has taken place.

Work on how public access will be given to the digital objects is likely to be the subject of a separate project, but this has yet to be set up. In the meantime, staff will need access for accessioning and cataloguing purposes, so this aspect is included in the current DDA project.

Q3. Have you considered the Open Archival Information Systems (OAIS) model, and if so, has it been useful for you?

Work began on designing our digital data archive several years ago and we started from first principles, working out what happened in the paper world and how this needed to be translated into the digital world. We subsequently found that the stages we identified correspond extremely closely to the OAIS model, and we therefore intend to use this model to guide the DDA project. We also hope that it will be useful to be able to quote this recognised standard when bidding for funds and so on, rather than just saying "this is what we think we need to do".

Q4. How long is "long-term" preservation for the National Archives of Scotland?

For ever, basically!

Q5. How will digital curation be funded? How long is your funding horizon?

Good question. There is no extra money for it at the moment, which is why our current work is quite low key. We are having to fit it into existing resources, which basically means part of an IT person and part of an archivist, and this in turn is dictating how long it is likely to take us (we currently have a target date of 1 April 2007 for a pilot system to be up and running). NAS is governed by the general Scottish Executive Spending Review, which is updated annually but looks ahead a full 3 years. We are also hindered by the fact that nobody seems to have much idea how much the various aspects of digital curation are likely to cost. Most of the projects already up and running (e.g. in Australia, America, The National Archives in the UK, and so on) have had funding in the millions, which it would be totally unrealistic for NAS to request. We are finding it very difficult to determine what is the minimum amount of money we will need to put together a simple, effective system, or how much it is likely to cost us to maintain it over time. This is an area where the Digital Curation Centre might profitably do some research; we know what needs to be done, but costing anything other than the hardware and software required is really difficult, so a few examples of "typical" costs which could be scaled up or down depending on quantities of material being curated and amount of staff available would be really helpful.

Q6. What tools are in use to import data?

None — data are sent to us on CD by the transferring authority.

Q7. What tools are in use to store data?

The data are currently kept on CD in its original format. We make a backup copy which is stored in a separate building.

Q8. What tools are in use to locate data?

Data are given an accession number when they come to the National Archives of Scotland (NAS), just like any other sort of information we receive. This is how it is currently located. We are currently working on cataloguing standards for digital material and I would assume that once our storage becomes more automated and server based, a unique identifier of some sort will also be needed.

Q9. What tools are in use to retrieve data?

At the moment, it is just a question of ordering the CDs out using the accession number. In future, the system will be much more sophisticated.

Q10. What tools are in use to package data to send out?

We don't yet do this.

Q11. What tools are needed for digital curation in addition to current tools? Or instead of current tools?

- Mirroring software to allow data stored on one server to be mirrored on a second server for backup purposes.

- Automated system for checking the integrity of data held on the servers (in our case using MD5) and alerting staff to any changes before those changes have been replicated across both servers and their backup tapes.

- An open source system for storing metadata would be useful for the digital preservation community generally. (In the absence of such a tool, NAS is currently planning to use SQL server which is not open source, but should allow us to convert metadata to XML for interoperability purposes).

- Tools for monitoring access to digital storage areas, so that it is possible to know who has accessed them, when and why, and what, if anything, they did to the data (to ensure that authenticity and integrity of information can be guaranteed over time).

- Strictly controlled access tools for different types of user (e.g. IT administrators, accessioners, cataloguers). In most cases, direct access to the digital archive should not be allowed, which means having some sort of "air-lock" system which allows them to deposit material into the archive and get copies of that material out again without having direct access to the digital archive itself.

- Automated tools for validating and checking material at ingest.

- Tools for flagging up format obsolescence and indicating what that format should be migrated to (i.e. some sort of automated version of PRONOM).

- Tools for migrating data at ingest, where necessary (although we intend to keep original formats, we may also need to migrate the data into a more universal format and preserve that too, if the original format is particularly obscure, ore we know it is about to become obsolete).

- Tools to support accessions procedures.

- Tools for automatically extracting metadata from digital objects (for accessioning, preservation and cataloguing purposes).

Q12. Which are most important, most needed?

All are equally important really, as one cannot have a properly streamlined system without one of the components. The key is that as much as possible should be automated, as the quantities of digital material we are likely to receive will mean that minimum human intervention will be the only way to get through the work.

Q13. How long before current hardware/software will be replaced? What standards are in use or needed?

Hardware/software tends to be replaced every 3 to 5 years.

Q14. What standards are in use or needed?

OAIS

PD0008 (some sort of development of PD0008 would be useful. At the moment, it is the only thing around to follow if one needs to be able to demonstrate the authenticity of the digital material you hold, but it is not really aimed at "born digital" material, more at the scanning process. Also, because it is not an actual British Standard, there is some discussion within NAS about whether it is worth trying to follow it to the letter, whether we should just pick and choose various elements, or whether it's really much use at all. We will be using it to some extent. Personally, I think it's better than nothing, but a more relevant standard could possibly be produced).

XML — the government's interoperability requirements (e-GIF and e_GMS) are based on XML which means that we must at least be able to convert metadata into it for sharing across government. I believe that some organisations are producing at least some of their data in XML format so that they will not have to migrate it later for preservation purposes. This is all very well for an organisation creating and preserving its own material, but for somewhere like NAS, which takes in material from a wide variety of organisations, we cannot dictate what formats those organisations create and keep their data in and therefore need to be able to cope with a wide variety (unless any future new archive legislation gives us the power to require organisations to migrate the data into particular preferred formats before sending it to us!).

Standards for all the above listed tools might be very helpful, as one would then know exactly what one needed and exactly which products fitted the bill.

Q15. What file formats are you using? Could you list them and highlight any issues such as: which file formats are the most important and why, are there any proprietary formats, and what changes do you foresee?

We have two sets of standards, one for the creation of digital surrogate material within NAS, and one for the creation of digital surrogate material as part of the Scottish Archive Network (SCAN) project. The NAS standards have been developed over the last couple of years, whereas the SCAN standards were established when this project was set up several years ago. NAS hosted and part funded the SCAN project, the main part of which has now finished. The residual part of the project has been taken back into NAS and continues to use its original standards. See the SCAN project website [external] for more information.

NAS standards:

Digital surrogate material is being created in 300ppi (pixels per inch) TIFF to internal standards which capture images at 1:1 ratio at high resolution (300ppi, Colour Depth 24-bits, Colour Space is RGB).

SCAN standards:

The SCAN project standards are based on their camera capacity which is:

Camera type CCD Size Total Available Pixels

Kodak Megaplus 6.3i

3072 x 2048

6291456

Atmel Camelia

3500 x 2300

8050000

That is repeated for each of the 3 colours giving a maximum file size of 18–24MB.

Q16. Am I right in assuming that the primary use of the NAS digital archives is "evidential" — if that is the correct term?

I would say that this is correct to a certain extent. The digital archives are selected for preservation for a combination of their 'evidential' and 'informational' value just as we would for paper archives. It is certainly true that the majority of the archives we have in the National Archives of Scotland (NAS) — whether paper or digital — can be used for evidential purposes. This can be to provide evidence of rights over property, to give the decisions in legal cases and to give details of individual obligations.

Q17. Is the data expected to be reused, considering reuse as being by people other than those who created it and/ or for purposes other than those for which it was created, e.g. historic or social research?

Yes, certainly. The overwhelming bulk of cases of data reuse in the future will be by people other than those who created it and for purposes other than those for which it was created.

Q18. Will the data be read by programs or by people or both?

I would expect the data to be read by both programs and people. How far one will predominate over the other very much depends on what exactly is meant by 'read'.

Q19. What will be needed in 50 years to re-use your data?

That really is the $64,000 question! There are so many ifs and buts I could put in here. The main thing that strikes me is that it is crucial to know what the data refer to, how they are structured and what (if anything) has happened to them in terms of manipulation or migration in the intervening 50 years. So I suppose what I'm saying is that good metadata and audit trails will be required in 50 years to re-use the data in the digital archives we hold. No matter how many different technological tools come along to help us manipulate, emulate or migrate the data we'll still need to know what the data mean.

Q20. How might DCC support digital curation work? For example, could it help by providing: - Services, e.g. advice or others? - Registries? - Professional development opportunitie

I think it would be most helpful if DCC could provide an advice service to digital repositories. Issues which could be covered include automated transformation of digital objects from their original format to a 'preservation' format, information on open source software that will allow us to read digital objects held in software formats which are not widely available. It could also be of interest to digital repositories if you were in a position to offer downloadable tools to enable people to carry out functions such as transformation from one format to another or, possibly in the medium to long term to emulate obsolete software and platforms.

It would certainly be useful if DCC were able to provide seminars or conferences on recent thinking in digital preservation and the emergence and application of new standards. We would also be interested in exploring the possibility of staff secondments either to DCC or other digital repositories.

The DCC is funded by

Joint Information Systems Committee