Interview with Ann Borda, Science Museum, London

Ann Borda has held strategic and operational roles in academic and cultural organizations with over 10 years experience. She recently held the position of Head of Collections Multimedia at the Science Museum in London where she managed several major initiatives such as a large New Opportunities Fund (NOF)-Digitise grant to digitise resources across three National Museums and which led to the development of the award-winning Ingenious [external] site. Ann was also involved in Fathom.com [external], an innovative e-learning collaboration led by Columbia University and the London School of Economics.

  1. What does Digital Curation mean for you?
  2. How do you do it for your data?
  3. Have you considered the Open Archival Information Systems (OAIS) model, and, if so, has it been useful for you?
  4. How long is "long-term" preservation for your research data?
  5. How will your digital curation be funded?
  6. How long is your funding horizon?
  7. What tools are in use now to manage NOF-digitise data and Science Museum data that is useful for research and/or learning and teaching? - to import data? - to store data? - to locate data? - to retrieve data? - to package data to send out?
  8. How long before current hardware/software will be replaced?
  9. What tools are needed for digital curation in addition to current tools or instead of current tools? Which are most important, most needed?
  10. What standards are in use or needed? Any protocols such as OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting)? Any models such as the Reference Model for an Open Archival Information System (OAIS)? Any metadata schemas? Any guidelines? Which are most important, most needed?
  11. Could you add something about the quantity of data involved, in terms of the total number of files and total volume of data?
  12. Could you add something about the quantity of data involved, in terms of the total number of files and total volume of data?
  13. Do you see any further issues related to file formats and data volumes curated by museums ? And do you foresee any changes in file formats and data volumes to be managed? You mentioned preservation issues relating to software. Do you foresee any additional challenges in future? If so, what are they likely to be?
  14. How do people get access to the data for commercial use?
  15. How rights issues might relate to the use of material in schools? Could NOF or other Science Museum digital material be repurposed by teachers, for example, as part of new learning objects?
  16. How else might DCC support your work?

Q1. What does Digital Curation mean for you?

I am familiar with the DCC definition, and concur with its various layers! Digital curation has always meant more to me than archiving and preservation, but involves an understanding of the life cycle of both 'born digital' and digital surrogates in terms of providing a full framework of best practice in standards implementation from the start, creation, management and delivery. Repurposable assets is a key outcome.

Q2. How do you do it for your data?

When I started on the NOF-Digitise project in which I was lead initiator and project manager, my level of understanding rose in terms of the parameters of digital curation. This was aided by workshops offered by NOF. Hence, I had a fuller appreciation of the digital life-cycle, the importance of standards in digital creation, naming, describing, managing (including DRM), delivery, and archiving.

I refer to the NOF-Digitise Technical Standards [external] and have disseminated this information for all teams involved in the NOF project.

Also the Technical Advisory Service for Images (TASI) has been consulted throughout the process and its very useful FAQs.

Q3. Have you considered the Open Archival Information Systems (OAIS) model, and, if so, has it been useful for you?

Yes, I did consider the OAIS model, and actually did a paper for the Open Archives Initiative (OAI) forum in 2003. We were mandated to use the Research Support Libraries Programme (RSLP) collections level description for NOF, but we referred to several models during the course of the project. This included looking at various interoperability solutions for a harvesting mechanism that we put in place to export/import digital records from a set of distributed databases.

Q4. How long is "long-term" preservation for your research data?

As part of the NOF-Digitise programme, I invested in a Storage Area Network (SAN) to enable preservation and access to the NOF-digitised assets for the benefit of our family of museums which are on a WAN: Science Museum (London), National Railway Museum (York) and National Museum of Photography, Film & Television (Bradford). The premise was to implement a solution that would last for about 5 years. However, 'long term' is also dependent on the resource type, the storage media itself and the ICT strategy of the organization.

For NOF, we stored on DLT tapes and on the SAN, we held most 'popular/marketable/requested' images where possible; although just before I left, there were discussions about migrating to a total DVD solution with lots more disk arrays and several more terabytes. The SAN was scoped at a time when it offered top storage space: 3 terabytes in 2001.

So we now have a possible issue of storage space (the SAN is filled to capacity) and cost of tapes in secure locations in a distributed environment.

The storage and preservation of files concerns high resolution digital files which average 20MB to 50MB each (depending on image type) x 45,000 (approx).

Other multimedia assets generated for exhibitions, etc. are being identified for preservation and access and to be part of a wider strategy, although I did regular archiving of resources created under my management as Head of Collections Multimedia (on portable hard drive and on CD-ROM). The organisational websites are regularly backed-up on tape or DVD, but similarly not yet integrated as part of formal preservation strategy.

Q5. How will your digital curation be funded?

NOF requires a 3 year monitoring period which shows continued digitisation and addition of new content to the NOF-funded website. Small in-house teams are continuing the work as part of a revised job remit. Other funding may be sought for specific content digitisation, e.g. Royal Society of Photography at the National Museum of Photography, Film & Television (NMPFT).

Unable to comment on other digital assets.

Q6. How long is your funding horizon?

Not sure. NOF funding ended in April. The Science Museum is still undergoing a restructuring process and particulars have not yet been addressed.

Q7. What tools are in use now to manage NOF-digitise data and Science Museum data that is useful for research and/or learning and teaching? - to import data? - to store data? - to locate data? - to retrieve data? - to package data to send out?

This is a complex question, because in the example of the NOF project we wanted to utilise both existing digital assets and newly digitised. The existing digital assets (text records and digital images) are held in different databases across three sites. Hence, we needed to look at developing a harvesting infrastructure to enable accessing, standardizing, and repurposing digital resources (i.e. delivery to the web) from across distributed environments. .

The challenges which this approach encompassed was undertaken in two key phases:

(i) implementing a harvesting model in order to overcome the challenge of retrieving and managing information from distributed source systems

and

(ii) integrating and delivering this content to the web and to other channels.

Phase 1 involved the development of a batch content hub ('interim database') which was built using Dublin Core (DC) fields as the primary data structure and which served as the central container for export files originating from five separate source systems. The interim database also functioned to normalize data, generate automatic fields and to process data through specific tools.

It should be noted here that a single export format was not possible across the 5 source databases currently in use at the different institutions. Only three of the legacy systems could export using XML, and others using CSV delimiters, with the UNICORN (library) records containing MARC tagged fields. Consequently, the conversion program of the interim DB required quite a bit of tweaking to strip out MARC fields and to deal with the variations among the CSV exports (e.g. rogue characters, diacritics, formatting, and so on).

In Phase 2 of the process, the data records held in the interim database were normalised and extracted as XML wrapped DC fields to the web Content Management System (CMS). The extracted metadata was subsequently integrated and managed for display, building searches and relational linking.

As alluded to in my last interview, we further based our final outcome on the OAI model and utilised the RSLP CLD schema, the latter modified to our requirements.

We were further knowledgeable of the CIDOC Reference Model and had contributed some datasets (National Railway Museum) towards its development, but the model was not sufficiently advanced or practical for us to use.

Cost of full implementation, various types of connectivity and upgrade issues across the sites, standardisation and cataloguing time, and human resources in general contributed to our overall choice of model and what we could deliver within deadlines.

I should add that only one of all our contractors working with us in digitisation and web infrastructure development were familiar with any metadata standards such as Dublin Core or other schemes. This unfamiliarisation added greatly to the cost of development and time in discussions to put forward our requirements. We ended up writing up most of the specifications ourselves and checking the programming to drive the process!

In regard to creating and storing files:
At the most fundamental level of the digitisation strategy for this project and it continues to date, an archival file is created for each resource that is digitally captured, in particular for image files. Digital capture for all images result in a master file produced to an archival (high) resolution depending on the resource type (e.g. monochrome print, or colour transparency).

Master file (archival copy)
File format TIFF
*File size 50MB (colour) 18MB (b/w)
Bit depth 24-bit (colour) 8-bit (greyscale)
Colour profile RGB (colour) greyscale (b/w)

The master file is stored in TIFF format, an open non-proprietorial format which is lossless and will allow for the maximum reuse. A file size of 50MB for colour files and 18MB for greyscale files has been selected, although this is an approximate size only and will ultimately depend on the resource type being digitised and the appropriate archival resolution for each (e.g. colour transparencies >=1200dpi, monochrome print or document >=600dpi).

While it is possible to create a much smaller master file, it was decided that as standards for digital image files increase, selecting a larger file size from the start would allow for maximum reuse of the image over a longer period of time and prevent the need for rescanning at a higher file size at a later date.

Ideally the central storage of images was to be assisted with transfer through a JANET set-up. However, The National Railway Museum and the National Museum of Photography, Film & Television have at present a 2MB link to the Science Museum.

This link is not robust enough for the transfer of many large files, although a WAN set-up would be highly desirable as part of the solution for the transfer of data files.

Until a higher bandwidth solution is in place, these sites, however, ship the digital image assets for uploading to the SAN on DLT tape, or a portable hard drive (for temporary transit with final storage on DLT).

To reduce the amount of additional effort required for saving images to tape for transfer, image assets captured at the Science Museum is saved directly to the SAN pending network load and security reviews.

The use of DLT tape was recommended by Technical Advisory Service for Images (TASI) and NOF, and it was deemed to have the most optimum lifetime if stored in correct relative humidity conditions. We also referred to the PADI website [external]

Q8. How long before current hardware/software will be replaced?

Difficult to say as it is a changing landscape — probably 3–5 years for hardware. Software upgrades are more frequent and will depend on the programme.

Have not touched the issue of preserving software programmes! I was familiar with CEDARS (Curl Exemplars in Digital Archives) when it started some time back. Another challenging area.

Q9. What tools are needed for digital curation in addition to current tools or instead of current tools? Which are most important, most needed?

I am not sure what is meant by tools here — it is rather a broad sweep! I think any choice or use of tools needs to be relevant and compliant to internationally recognised and implemented standards. Preferably open source, but not to exclude proprietary formats that are industry recognised, authoritative and reputable.

As mentioned above, I experienced a number of contractors who 'implied' they were knowledgeable of standards or tools, for instance, but could not come up with the goods when it came to implementation.

This is a real problem with any organisation who do not have an in-house team that can pick up on such a failing and who make a major investment in a CMS system or digitisation infrastructure, for example. There needs to be some sort of quality check criteria of tools and of suppliers and companies working in the area.

Q10. What standards are in use or needed? Any protocols such as OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting)? Any models such as the Reference Model for an Open Archival Information System (OAIS)? Any metadata schemas? Any guidelines? Which are most important, most needed?

Standards needed will depend on what is intended, whether the organisation is part of a wider partnership or not (or recognises the benefits of a partnership!), and will also need to be linked to the business objectives of an organisation. I think the latter is often misunderstood or overlooked as the practitioners do not always have the ear of the administration and vice versa so any potential benefit of standards is not given due importance.

Q11. Could you add something about the quantity of data involved, in terms of the total number of files and total volume of data?

I managed both the collections management system and the NOF-Digitise project across the Science Museum and sister sites, and I only have an indication of total volumes just before I left.

In regard to the data volumes relating to the collections management system (MultiMIMSY), this approximated 2GB, representing just over a half million text records.

As for the NOF project, I have the following figures:

  • Text 3,500 object records catalogued; 26,000 image records
  • Images 38,000 comprises: SCM=20,000; NRM=10,000; NMPFT=8,000
  • Audio N/A
  • Video N/A
  • Other Bibliographic records, ca. 26,000

These figures represent totals for the materials digitised by type. However, it does not reflect that the images, for instance, were digitised as a Master TIFF, and then processed in 3 sizes of JPEG (full screen, catalogue record size and thumbnail), and for certain sets of images, these were saved as PNGs from the TIFFs.

As of March 2004, the total volume of the image and text files exceeded 3.5 terabytes on the SAN RAID, and this may not represent all the images stored on DLT at the other sites which held images generated for projects outside of NOF and for the in-house system, iBAse.

As touched upon previously, storage was a key issue and it was necessary to decide on an appropriate policy of selection. For instance, our commercial picture library required access to the most 'requested' images and other saleable pictures. The high-end TIFFs and PNGs needed to be readily accessible for the library as well as for clients. The Licensing department had a similar request. Images not used in this sense could be kept stored on DLT. An issue here was appropriate labelling and directory structure so that the right images could be retrieved on the tape. Before I left, there was still indecision in terms of next steps with storage. DVD was seen also as a possibility, but a query about longevity of the medium. And network access through the intranet to the stored images was yet another consideration.

Q12. Could you add something about the quantity of data involved, in terms of the total number of files and total volume of data?

Museums may not be in a position to afford the heavy costs of digitisation and storage and provide preservation and access, as well as subsume the costs of hardware/software replacements or upgrades.

While some may argue that hardware is ever lower in price and more storage space can be purchased than ever before, an organisation must have guidance on anticipating total volumes and have an indication of costs of maintenance, resourcing, and any customisations or integration with existing systems.

I believe that more outsourcing options should be made available that would allow museums to be able to store and access their digital assets from data services. Perhaps a museum could have the option of holding lower resolution images that may be used for learning resource creation, e.g. web, and for collections records, and then rely on outsourced services to manage the master files and to provide access to them as needed through a client log-in or other method.

This may compensate for a number of issues relating to hardware/software, management and storage, resourcing costs, and so on.

In terms of file formats, even in the digitisation of images, we experienced mixed formats depending on the purpose and use, and whether an image file should be compressed or not. I took part in the BSI/JPEG2000 study and this seemed to provide a possible solution but few practitioners are aware of its use or application yet.

An ideal direction seems also to be in the area of 'wrappers' that would permit any file format to be read or processed, etc. and then if a file format becomes obsolete, the wrapper would carry a subset of the file information and still could be preserved and accessed.

As for challenges, I think software preservation is a key one, as well as the continuing issue of open source vs proprietary formats, and changing digital platforms — convergence verses diversification. And simply the economic and social ramifications of digital access in general.

The metadata arena has provided further challenges in terms of structuring and relating data across different domains and sectors, and across vocabularies and disciplinary contexts — this is equally interdependent with the digital curation issues.

Q13. Do you see any further issues related to file formats and data volumes curated by museums ? And do you foresee any changes in file formats and data volumes to be managed? You mentioned preservation issues relating to software. Do you foresee any additional challenges in future? If so, what are they likely to be?

The Science Museum has a commercial picture library (AKA Science & Society Picture Library) that deals with commercial clients and researchers, and so on, and manages most of the digital surrogates for commercial sale, and also catalogues the images specifically for its target users. The picture library website has a link to each of the images available on the NOF-funded site (and duly credited as a source), as well as having its own website, and a link on the main Museum website.

Beyond this, the marketing of the picture library is done by the picture library manager and not by the organization so I am not sure of what other syndicated links or affiliations the library may have. It sits in the Trading Company division of the Museum which is largely self-financing, although profits do feed back into the main Museum.

Q14. How do people get access to the data for commercial use?

As part of the NOF agreement, the funded projects MUST clear rights as far as possible, and make the digitised assets appearing on the funded sites free at the point of access. Therefore, schools would be able to make use of the material for educational purposes. This was probably one of the larger stumbling blocks within the Museum because of the time to clear rights, especially via the resources available, and to still retain commercial viability (e.g. to reuse assets for commercial purposes and not be diluted by their appearance on the web). An agreed solution by the internal stakeholders was that each of the web images was to be watermarked so that if higher-end use is to be made, a clean image would need to be requested from the picture library.

The Science Museum, as well as the NOF project, have had a tradition of making available digital material, and authored learning resources, for teachers and students. In particular, the NOF-funded website has a CREATE section (still being tweaked/developed) which allows a user to search for images, gather these into a 'gallery' and make them into e-cards, or their own personalised slideshow with their own authored content alongside.

See the New Opportunities Fund [external] project website and others which I was involved with at the Museum that encompassed digital creation, management and delivery as described over the course of the interviews:

Q15. How rights issues might relate to the use of material in schools? Could NOF or other Science Museum digital material be repurposed by teachers, for example, as part of new learning objects?

I would definitely like to see DCC provide an advisory service, complementary to Technical Advisory Service for Images (TASI), but a more national remit. The advisory service would cover the full life-cycle of digital assets AND their curation. Digital assets covering, not just images, but other digital file types, digital ephemera, web archiving, etc.

There might be a benefit (albeit idealistic) if DCC considered a role as a repository much like the National Archives — perhaps, more realistically, supporting a network of repositories of digital assets of 'national significance' . This would involve setting up registries as a foundation — but could be a longer term goal.

Of course, the keeping up pace with standards and best practice is key to the current work, and an advisory role to government and funders of digitisation initiatives. Further supportive work and dissemination strategies, such as workshops and conferences and FAQs/reports, would also be of immediate benefit to users.

And building on this, a role in professional development and course implementation. Perhaps a certification or higher qualification level programme. I know of the short course programme held recently at King's College and this could be a model for DCC, or something to expand on.

I would conclude that the possibilities are quite varied and broad and it would be easy to diffuse the remit and effectiveness of DCC, not least because digital curation itself has wide parameters and contexts! The most benefit might naturally be derived from core and targeted activities derived from a needs analysis (users, creators, funders and future gazing!), and partnerships across the country and internationally who can support and collaborate on larger initiatives and balance the short-term and long terms potential goals of the DCC.

The DCC is funded by

Joint Information Systems Committee