Question on approaches to curating textual material

24 July, 2007
Dave Thompson, Digital Curator at the Wellcome Library asked a question on the Digital Preservation list (which is not well set up for discussion just now). I've replied, but we agreed I would adapt my reply for the blog for any further discussion that might emerge.
"I'm looking for arguments for and against when, and if, digital material should be normalised. I'm thinking about the long term management of textual material in proprietary formats such as MS Word. I see three basic approaches on which I'm seeking the lists comments and thoughts.

The first approach normalises textual material at the point of ingestion, converting all incoming material to a neutral format such as XML immediately. This would create an open format manifestation with the aim of long term sustainable management.

The second approach would be one of 'wait and see', characterised by recognising that if a particular format isn't immediately 'at risk' of obsolescence why touch it until some form of migration becomes necessary at some future point.

The third approach preserves the bitstream as acquired and delivers it in an unmodified form upon request, ie MS Word in – MS Word out.

The first approach requires tools, resources and investment immediately. The second requires these same resources, and possibly more, in the future. The future requirements for the third approach are perhaps unknown aside from that of adequate technical metadata.

I'm interested in ideas about the sustainability of these approaches, the costs of one approach over the other and the perceived risks of moving material to an open format sooner rather than later. I'd be very interested in examples of projects which have taken either approach."
Dave, the questions you ask have been rumbling on for years. The answers, reasonably enough, keep changing. Partly depending on who asks and who answers, but also depending on the time and the context. So that's a lot of help isn't it?

You might want to look at a posting in David Rosenthal's blog on Format Obsolescence as the Prostate Cancer of Preservation (for younger and/or non-male curators, the reference is that many more men die WITH prostate cancer than die because of it.) Lots of food for thought there, and some of the same themes I was addressing in my Ariadne article a year or so ago.

The simplest answer to your question is "it depends". If you've got lots of money, and given the state of flux right now in the word processing market, I would suggest doing both (1) and (3): that is make sure you preserve your ingested bits un-changed, but also create a "normalised" copy in your favourite open format.

What format should that be? Well for Word at the moment it might be sticky. PDF (strictly PDF/A if we're into preservation) might be appropriate. However as far as ever extracting useful science from the document is concerned, the PDF is a hamburger (as Peter Murray Rust says; he reports Mike Kay as the origin: "Converting PDF to XML is a bit like converting hamburgers into cow"). PDF is useful where you want to treat something exactly as page images; it is also probably much less useful for documents like spreadsheets (where the formulae are important).

Open Document Format is an international standard (ISO/IEC 26300:2006) supported by Open Source code with a substantial user and developer base, so its long term sustainability should be pretty strong. I've heard that there can be glitches in the conversions, but I have no experience (the Mac does not seem to be quite so well served). Office Open XML has been ratified by ECMA, and is moving (haltingly?) towards an ISO standard. Presumably its conversion process will be excellent, but I don't know of much open source code base yet. However the user base is enormous, and MS seems to be getting some messages from its users about sustainability. Nah, right now I would guess ODF wins for preservation.

It may not apply in this case, but often there is a trade-off between the extent of the work you do to ensure preservation (and the complexity and cost of that work), and the amount of stuff you can preserve. Your budget is finite, right? You can't spend the money twice. So if you over-engineer your preservation process you will preserve less stuff. The longevity of the stuff in AHDS, it turns out, was affected much more by a policy change than by any of the excellent work they did preserving it. You need to do a risk analysis to work out what to do (which is not quite the same as a crystal ball; few would have seen the AHRC policy change coming!).

It's also probably true that half or more of the stuff you preserve will not be accessed for a very long time, if ever. Trouble is (as the captains of industry are reported to say about the usefulness of their marketing budgets, or librarians about their acquisitions) you don't know in advance which half.

Greg Janee of the NDIIP NGDA project gave a presentation at the DCC (PPT) a couple of years ago, in which he introduced Greg's equation:
Item is worth preserving for time duration T if:
(intrinsic value) * ProbT(usage) > SumT(preservation costs) + (cost to use)
... ie given a low probability of usage in time T, preservation has to be very cheap!

What I'm arguing for is not putting too much of the cost onto ingest, but leave as much as reasonable to the eventual end user. After all, YOU pay the ingest cost. Strangely, so, in a way, does the potential end user whose stuff was not preserved if you spent too much on ingest. You do need to do enough to make sure that end use is feasible, and indeed appropriate in relation to comparator archives (you don't want to be the least-used archive in the world). You also must include, in some sense or other the Representation Information to make end use possible.

But you don't have to constantly migrate your content to current formats to make it point-and-click available; in fact it may be a disservice to your users to do so. Migration on request has always seemed to me a sensible approach (I think it was first demonstrated by Mellor, Wheatley & Sergeant (Mellor 2002 *) from the CAMiLEON project based on earlier work in the CEDARS project, but also demonstrated by LOCKSS). This seems pretty much your second approach; you just have to ensure you retain a tool that will run in a current (future) environment, able to migrate the information. Unless you have control of the tool, this might suddenly get hard (when the tool vendor drops support for older formats).

I've often thought, for this sort of file type, that something like the suite might be the right base for a migration tool. After all, someone's already written the output stage and will keep it up to date. And many input filters have also already been written. If you're missing one, then form a community and write it, presto the world has support for another defunct word processor format (yeah I know, it's not quite that easy!).

I was going to argue against your option 3 (although it's what most repositories do just now). But I think I've talked myself round to it being a reasonable possibility. I would add a watching brief, though: you might decide at some point that the stuff was getting too high risk, and that some kind of migration tool should be provided (in which case you're back to option 2, really).

I get annoyed when I hear people say (what I probably also used to say) that institutional repositories are not for preservation. It's like Not-for-Profit companies; they may not be for profit, but they'd better not be for loss (I used to be on the Board of two). Repositories are not for loss. They keep stuff. Cheaply. And to date, as far as I can see, quite as well as expensive preservation services!

* MELLOR, P., WHEATLEY, P. & SERGEANT, D. (2002) Migration on Request, a Practical Technique for Preservation. Research and Advances Technology for Digital Technology : 6th European Conference, ECDL 2002, Rome, Italy, September 16-18, 2002. Proceedings.

More about

digital curation, DCC