Interview with Richard Wright, Information and Archives, BBC

Richard worked in acoustics, speech and signal processing for US and UK Governments, University College London, RNID and Cirrus Research. Technology manager, BBC Information & Archives (I&A) since 1994. BBC I&A are partners in the JISC-NSF digital libraries project Spoken Word, and have participated in DELOS. They work closely with the two JISC audiovisual digitisation projects (BUFVC-ITN and British Library Sound Archive). The BBC is running its own major in-house audiovisual digitisation project (covering several hundred thousand hours of material).

  1. What does digital curation mean for you, and, in general terms, how do you do it for your data?
  2. Have you considered the Open Archival Information Systems (OAIS) model, and if so, has it been useful for you?
  3. How long is "long-term" preservation for the BBC Archives?
  4. How will digital curation be funded? How long is your funding horizon?
  5. What tools are in use now to manage NOF-digitise data and Science Museum data that is useful for research and/or learning and teaching? - to import data? - to store data? - to locate data? - to retrieve data? - to package data to send out?
  6. What tools are needed for digital curation in addition to current tools or instead of current tools? Which are most important, most needed?
  7. How long before current hard/soft ware will be replaced?
  8. What standards are in use or needed? Any protocols such as Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH)? Any models such as the Reference Model for an Open Archival Information System (OAIS)? Any metadata schemas? Any guidelines? Which are most important, most needed?
  9. It would be interesting to know if you have any recommendations to those who create the material that will end up in your archives about tools or standards they could use to make your work easier ; if so, which tools or standards are recommended?
  10. To summarise, the standard file formats for your media data are BWF (broadcast wave format, the European Broadcasting Union variant on WAV) for audio and MXF (which can wrap BWF) for video. And you also hold 14.5k hours of Real Audio.
  11. Would you say that this standardisation of archival formats is stable both at the BBC and across the EBU, or do you foresee changes?
  12. I got the impression that the metadata is held as data within Informix, but also perhaps as an electronic document of some kind — or does BRS/Search simply process the Informix data? And I suppose that the electronic documents you manage are in various formats?
  13. Could you provide an estimate of the quantities of data you are managing, in terms of total volume of data, total number of files?
  14. You raised the issues of security of media files online, of ensuring the integrity of files on carriers and the carriers themselves, of version control and provenance tracking. Are there any challenges relating particularly to file formats and data volumes that you are having to address now, and if so, how are you addressing them? And what challenges do you foresee in the future?
  15. Data reuse How is the BBC's archived data reused? In new programmes? In creating resources for learning and teaching?
  16. For this reuse, is it read by programs or by people?
  17. How might DCC support digital curation work? For example, could it help by providing: - Services, e.g. advice or others? - Registries? - Professional development opportunities?

Q1. What does digital curation mean for you, and, in general terms, how do you do it for your data?

For our audiovisual signals recorded as digital data on various media, we apply essentially the same process we've applied from time to time over nearly 50 years: migration of the data to new carriers. Mainly in the past this has been analogue to analogue migration (e.g. LP to 1/4" audio tape; nitrate to acetate film). Currently we are in the middle of a 10 to 20 year analogue to digital migration (transfer). We have begun limited digital to digital migration, e.g. from DAT audio tape to "audio CD" (and files on DVD-ROM).

We eventually will migrate from discrete (hold-it-in-your-hand, put-it-on-a-shelf) media to mass storage, though if the mass storage is data tape, the 'overspill' from even a large data tape robot may consist of data tape cassettes that will or at least could sit on shelves, exactly as for our present discrete digital media (CDs, digital video tape).

Our general principle, for our audiovisual material, is not across-the-board preservation for perpetuity. As with records management, we have review dates and a retention policy (agreed collectively across the BBC). Not all material entering the archive is kept forever. It may be deleted after 5, 10 or more years for various reasons. Generally there is consultation with the originators of the material before deletion, and material is also offered to the British Film Institute (BFI) and British Library Sound Archive.

For material that is to be retained, and which is on an obsolete or decaying carrier, we apply the migration process in order to "maintain the viability of the holding" and maintain access.

We also apply the migration process just for obsolescence of the format, purely to maintain access — in this case we would generally also retain the original (mainly this happens now for film, which needs a 'proxy' on a current videotape format in order to be readily used).

Q2. Have you considered the Open Archival Information Systems (OAIS) model, and if so, has it been useful for you?

We've heard of the OAIS model, through project work with universities. Broadcast archives are really still entering the digital world. Our main holdings remain largely analogue, and almost exclusively discrete media on shelves. Hence conventional archive management models obtain. Although many broadcasters now have some experience of mass storage, the material on mass storage remains a tiny part of total holdings. As our mass storage systems develop, our appreciation of OAIS and other techniques for digital management will also have to develop.

Because of our history of (or resignation to) migration, we aren't as worried as many other are about issues like obsolescence of digital formats — because of our decades of facing obsolescence of analogue formats. We have problems equivalent to "obsolete document file formats" for audiovisual material, but as we're committed to migration we expect to continue to migrate. In fact we look forward to digital files on mass storage, because they are far cheaper to migrate than data on discrete carriers.

Q3. How long is "long-term" preservation for the BBC Archives?

For the BBC, national programmes that have entered the main archive and been fully catalogued have not, in general, been deleted. The deletions within the retention policy mainly apply to 'contribution material' i.e. components (rushes) of a final programme, or untransmitted material. Hence, "long-term" for "national programmes that have entered the main archive and been fully catalogued" means in perpetuity. We have already kept some material for more than 75 years, including multiple format migrations.

Q4. How will digital curation be funded? How long is your funding horizon?

We call it "preservation", the digital part is a means to an end and isn't highlighted. It is funded by special, specific internal funding, awarded after "making a case" up to the Director General and Board of Governors level. We currently fund at about £4M of preservation work per year, scheduled until 2010.

Essentially the longest range funding is for 10 years, though money is awarded in three-year chunks and within a 3 yr period there are annual reviews and funding "adjustments". We are in year 6 of our first ten-year plan, and have just submitted preliminary planning data (because of BBC charter review) for a second 10 years.

Q5. What tools are in use now to manage NOF-digitise data and Science Museum data that is useful for research and/or learning and teaching? - to import data? - to store data? - to locate data? - to retrieve data? - to package data to send out?

We are in the middle of the first stage of a two-stage conversion from analogue-on-shelves to digital-on-mass-storage (of some sort). The first stage is conversion to 'digital on shelves': CDs, DATs, digital video tape on shelves.

We also have data about this media collection, stored in databases (of data about data) — metadata so-called, diverging from the use of the word in IT (where metadata can be a lot of things, but it can't be data).

We also have digital textual data: electronic documents — which are leading the way (for us) to the digital future.

So: import of digital data is a standard library 'acquisition' process for audio and video material. The key issue is having a unique identifier on discrete material (or on files on mass storage), and BBC are moving to using the the Society of Motion Picture and Television Engineers (SMPTE) Unique Material Identifier (UMID).

As programme production moves to server-based systems, we move to acquisition by electronic import — first of metadata (generally moving to use of XML for this) and then of media files. The standard file formats for our media are BWF for audio and MXF for video.

Finally, we're using the docs mgmt application Livelink in pools across the BBC, as we gradually move toward corporate docs mgmt. Livelink has import, store, locate, retrieve and export functions.

Storage: we have 14.5k hours of Real Audio on a server, backed up by copies on DVD. All the rest of our media files are on CD or DVD, in anticipation of a proper BBC-wide mass storage system.

Locate: We use an Informix database for our TV and Radio catalogue, which has conventional search tools. In addition we use a version of Universal Decimal Classification (UDC) for subject indexing — using facets covering location, time, type of shot and genre. These tools are from the analogue days, still applicable however to digital on shelves AND to digital files on mass storage (when we get there).

We DON'T have a direct link between our catalogue and location in mass storage (because our mass storage is shelf based so the standard 'shelf number' locating process still works. We use direct storage of the URL in the database for access to our 14.5k hours of Real Audio. A proper integration of our catalogue with mass storage will need the important step of maintaining the link between the catalogue and the location in online storage (or hierarchical storage). All digital library products recognise this need — we just don't have one yet.

We also use free-text search (brand: BRS/Search) on the TV and Radio catalogue, and have implemented a commercial federated search tool (though not yet pointed that at our media catalogue; at present it searches our licensed external databases).

Retrieve/package/send out: several daily vans from the archive to the rest of the BBC in London, and overnight vans to the regions. We have an internal, bespoke 'content delivery network' where we're starting to distribute media files in a small way. We have also done drag-and-drop across the WAN from server to server for limited distribution of media files. This electronic delivery process needs to grow enormously, be automated, and have better BBC-wide IT-network bandwidth to support it.

Q6. What tools are needed for digital curation in addition to current tools or instead of current tools? Which are most important, most needed?

We need mass storage, and then we need as stated a method for automated electronic delivery of media files. This will include multiple levels: retrieve textual description; see the keyframes; view the browse-quality copy; select portions of real interest; retrieve the production-quality material (just the selected portions). These tools exist in the general 'media management/asset management' world, but aren't implemented across the BBC. These are access tools, not curation tools, I guess.

For "curation" — my main concern is security of the media files online. We're asking the storage industry about tools for checking/refreshing files on carriers, and carriers themselves (e.g. data tape). We know several companies that are starting to improve such tools for data-tape robotic systems (FPDI, Hi-Stor, Grau), but we also know various archives with robots that feel existing tools are inadequate. We're trying to learn from them.

As we get mass storage with effective maintenance software, we'll also need general 'digital library' or 'docs mgmt' tools that book electronic files in and out, track them, perform version control and provide provenance. Again, these exist now for docs mgmt, but need to migrate to broadcasting, and then be specifically integrated with the BBC systems.

Q7. How long before current hard/soft ware will be replaced?

I know five European media archives that are now on their 2nd tape robot: INA, RAI, Austrian Mediathek, SVT, SWR. In broadcasting, we've always complained that our formats were so short-lived, compared to film (nearly a century) and paper (many centuries). Video and audio formats were changing in less than 10 years. Now it looks like IT systems will continue to change significantly (to the point of needing replacement) in under 5 years. So the rate of 'churn' is going to be at least double, and media archivists generally are resigned to permanent migration.

Q8. What standards are in use or needed? Any protocols such as Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH)? Any models such as the Reference Model for an Open Archival Information System (OAIS)? Any metadata schemas? Any guidelines? Which are most important, most needed?

Standard file formats for exchange of audiovisual media are important. We in the BBC and across the EBU, at least, use the EBU variant on WAV (= BWF, broadcast wave format) for audio, and are adopting MXF for video (it can also wrap BWF, so MXF is the most general file format).

Metadata: There are many options including schemes from the SMPTE and EBU and proprietary schemes too. The conclusion is that broadcasting is just big enough to ignore the outside world and invent its own standards (as it has to for things like digital broadcasting) — but not big enough for these standards to really matter to the general world, which now equals the web world for real commonality and interchange.

Q9. It would be interesting to know if you have any recommendations to those who create the material that will end up in your archives about tools or standards they could use to make your work easier ; if so, which tools or standards are recommended?

We've generally been at the tail end of the production process: the BBC makes programmes (on something; on some sort of media); we get the something. Broadcasting is a pretty controlled industry — there is an official BBC document on what a programme-maker has to delivery, in terms of both media and metadata. The metadata needs to move from paper to electronic, complicated by the amount of external production. We try to convince the BBC that all electronic systems supporting metadata need 'input control' — use of authority and controlled vocabulary and anything else to prevent freestyle input and all its attendant errors. This is standard 'how to raise the quality of IT systems' stuff, not specifically digital curation.

We hope to convince the BBC that it makes assets, and we (the archive) control those assets, birth to death. This is the document management model, extended to media in general. To the extent that these model becomes reality, we will also control metadata from birth to death, so we will be responsible for our own input. That's our goal — controlling the media that enters our archives from birth to death, so that archiving is central, not at the end.

I really don't know how much OAIS or METS or digital library technology or even standard library technology will play a role, because broadcasting stands outside that whole world, and largely ignores it!

Q10. To summarise, the standard file formats for your media data are BWF (broadcast wave format, the European Broadcasting Union variant on WAV) for audio and MXF (which can wrap BWF) for video. And you also hold 14.5k hours of Real Audio.

Real Audio is compressed (heavily) and so would be unethical as an archive format (unless nothing better were available). We archive the uncompressed signal as BWF files, and make Real Audio files as an 'access copy' for intranet access.

Q11. Would you say that this standardisation of archival formats is stable both at the BBC and across the EBU, or do you foresee changes?

The audio file format BWF is standard and relatively stable, though it is evolving to cope with technology development, principally multichannel audio (the surround sound of the cinema, coming to a home cinema near you — soon).

There are still multiple competing formats for video: proprietary ones like Microsoft AVI and SONY IMX compete with MXF. AVI is a very general format holding lots of options that cover media uses well outside broadcasting (mainly the web and domestic markets), so the take-up of AVI files is huge. IMX is for professional video. I think MXF is beating IMX, and that MXF will dominate as the pro video file. Certainly most broadcasters try to adopt open standards where available, as a matter of policy — we hate being tied to a singe source/vendor.

Q12. I got the impression that the metadata is held as data within Informix, but also perhaps as an electronic document of some kind — or does BRS/Search simply process the Informix data? And I suppose that the electronic documents you manage are in various formats?

BRS/Search processes the Informix data. As a separate BBC activity, our department also manages the BBC registries and associated long-term paper document store (in Caversham). Our 'docs mgmt' area is moving into electronic files and mass storage faster and further than for audiovisual media. We are using commercial document management software, mainly Livelink. This has been used on small projects, and is now being trialled more extensively across the BBC.

BBC is mainly a 'Microsoft house' — we have a standardised 'desktop' = PC configuration and software, including M/S Office. So documents should be mainly Word and PDF, and some Excel spreadsheets. Specialist areas use special packages like Autocad and audiovisual editing which are much more problematic and I'm not sure that we've done much about archiving those. We are now managing the paperwork for about 2700 people including HR, legal, parts of rights and other major sources of significant documents. So this is roughly 20 to 25% of total BBC official paperwork, and increase from 0% 3 to 5 years ago, and the target is 75% in 3 years. Storage is on ordinary network servers (as Livelink datasets), backed up to tape following the same process as for any BBC servers. Also the BBC sends itself 1 million e-mails per day, which also get backed up somewhere. E-mail doesn't get handed by our department at all — it's purely handled by the IT dept.

Our documents department are lobbying for a comprehensive system that would include migration / emulation to ensure stored docs will be read again — at present it's stored as Livelink datasets.

We also have a New Media Archivist who is liaising with the National Archive regarding their registry and collection of 'legacy software' — aimed to try to provide a platform for running things like old Autocad files.

Q13. Could you provide an estimate of the quantities of data you are managing, in terms of total volume of data, total number of files?

The archive is managing about 700,000 digital items, but mainly these remain discrete media (digital video tape, CDs, DVDs). We have about 280,000 actual master files — digitised from U-Matic video and 1/4" audio originals, and from (film) magnetic sound tracks. We have another 60,000 viewing-quality video files, but these are held on CD-ROM in anticipation of a mass storage system (we started anticipating in 1995!).

Total digital data:
12 petabytes, mainly on digital videotape

Total data in files:
8000 hours of U-Matic = 80 terabytes; 80,000 hours audio = 50 terabytes; total 130 terabytes

Growth rate: about 10,000 hours/year = 400 terabytes, mainly on dig

Q14. You raised the issues of security of media files online, of ensuring the integrity of files on carriers and the carriers themselves, of version control and provenance tracking. Are there any challenges relating particularly to file formats and data volumes that you are having to address now, and if so, how are you addressing them? And what challenges do you foresee in the future?

One challenge is multiple versions of 'the same programme' — in archive quality, intranet viewing quality and in web quality. There is industry software for checking that an encoded file is actually "coherent" — a file can be correct according to the operating system and file system, and still not be properly encoded. You sometimes see this with ZIP files that move around OK but won't open. All encoded media files are like ZIP files in this regard: they have two sets of rules to follow (operating/file system, and the encoder/decoder). The BBC need to do more about the second sort of checking for its encoded files.

There is experimental software for attempting to ensure that various different files (proxies) of "the same programme" are mutually coherent. The BBC has made no use of such software, as it is still in the lab stage (and may never get out).

The new media file formats like MXF are hugely complex — really they are wrappers, so they introduce a third layer of rules. MXF has been introduced with a toolkit supporting some sorts of file checking — but the very fact of wrapping files within a larger unit, makes it harder to check the integrity of such elements.

Finally, we have made a start at web archiving, and that is now generating a certain volume of data — surprising small because we're getting only the text. All the streamed media is separately stored (or not; big remaining unsolved issue, whether streamed material can or should be archived).

Q15. Data reuse How is the BBC's archived data reused? In new programmes? In creating resources for learning and teaching?

About 25% of the BBC archive is accessed (taken off the shelf) per year. At the moment the BBC Archive issues on average 155,000 items per month. The material can be used for a variety of purposes; preview, research, clip, selection, repeat, and so on.

These figures do not include commercial sales of whole programmes, which are dealt with separately by our commercial subsidiary, BBC Worldwide.

Some of our new programmes are of course "resources for learning and teaching". I don't have figures on how much goes to broadcast, to the web, and 'other'. I do know that BBC Radio 6, Radio 7, and significant parts of BBC TV 4 are archive material, some rebroadcast (Radio 7) but some also 'repurposed' as compilations, historical perspectives, comparisons with today, tracking performing artists through the decades and so on.

Q16. For this reuse, is it read by programs or by people?

The metadata is read by people, though as part of the overall business of broadcasting we have some metadata that is moved between computer systems, and some of that eventually hits the viewer as an EPG (Electronic Programme Guide) where it is once again read by a human.

What will be needed in 50 years to re-use your data?

Our whole approach is based on high-turnover and continuous migration, so in 50 years we will have migrated data into current formats — probably 10 times! The life expectance of mass storage systems appears to be less than 5 years — even shorter than the life span of digital tape formats. I know many broadcasters (well, 5) that are currently migrating onto their 2nd storage robot.

Q17. How might DCC support digital curation work? For example, could it help by providing: - Services, e.g. advice or others? - Registries? - Professional development opportunities?

The BBC archive is based in a business, not an academic department or a conventional department. A potential gap in industrially-based repositories is contact with leading academic thinking, which leaves us at the mercy of vendors and the commercial market. Academics are developing trusted digital repositories, the commercial market is selling us "storage solutions", often highly proprietary and often with no distinction between backup and archiving.

So I would hope the DCC would get out into the world enough to recognise the real need we have for enlightenment, leadership, guidance and support — and practical, available, supportable digital storage, access and delivery systems (working systems) — as alternatives to the commercial line from EMC, IBM, HP and the general storage industry.

I suspect DCC may not see itself as having anything to do with the 'storage industry' — the horny-handed suppliers of tape and other media, the hardware they plug into, and the software that runs them. But in my view the software for management of storage media, to manage monitoring, refreshing and migrating data, is weak at present and is an area where DCC could provide guidance and standards, or even kite-mark or reference good practice.

I'm especially concerned about hard drives. People who buy tape robots get some sort of maintenance software, sometimes — but people who are now able to cobble together larger and larger arrays of discs may have no monitoring software, poor backup software, and nothing that looks like archiving or 'repositing' software — in short, lacks the key element of curation — the care-taking element. It is now fairly easy to put together about 10 terabytes of hard drives run from a desktop PC — for something like £10,000 to £20,000. I suspect that projects all over the UK are starting to amass these systems, and most of what's going on the them is very much at risk. The latest generation of large domestic hard drives — the ones in the hundreds of gigabytes — appear to have something like 20% failure rates in the first year. If 10,000 local projects each lose 1 terabyte of data from that and similar problems — that's a loss of 10 petabytes of data. We've all lost data, but the new mass technology introduces the prospect of massive losses. I hope the DCC will consider low-level and practical issues like small projects with inadequate security for their 10 terabyte stores, and the general industry issue I first mentioned: the need for guidance, advice and good references in the 'storage sector'.

Practical ways to provide guidance have already been developed by ERPANET, setting a model for developing and communicating expertise, by web and by workshops. I'd like to see the DCC take the ERPANET approach, to giving us all support in digital data and storage technology/philosophy/strategy — without losing touch with practical realities and practical help.

The DCC is funded by

Joint Information Systems Committee