Because good research needs good data

Specifications again

Chris Rusbridge | 06 January 2009

The previous post was a summary with relatively little comment from me. I really liked David Rosenthal's related blog post, but I feel I do need to make some comments. I'm not sure this isn't yet another case of furiously agreeing!Near the end of his post, following extensive argument based partly on his own experience of implementing from specifications in a "clean-room" environment, and a set of postulated explanations on why a specification might be useful, focusing on its potential use to write renderers, David writes the statement that makes me most uneasy:

"It seems clear that preserving the specification for a format is unlikely to have any practical impact on the preservation of documents in that format."

The suggested scenarios re missing renderers are:

"1. None was ever written because no-one in the Open Source community thought the format worth writing a renderer for...2. None was ever written because the owner of the format never released adequate specifications, or used DRM techniques to prevent, third-party renderers being written...3. A open source renderer was written, but didn't work well enough because the released specifications weren't adequate, or because DRM techniques could not be sufficiently evaded or broken...4. An open source renderer was written but didn't work well enough because the open source community lacked programmers good enough to do the job given the specifications and access to working renderers...5. An open source renderer was written but in the interim was lost...6. An adequate open source renderer was written, but in the interim stopped working..."

Read David's post for the detail of his arguments. However, I'd just like to suggest a few reasons why preserving specifications might be useful:- First, if the specification is available, it is (comparatively) extraordinarily cheap to keep. If it even makes a tiny difference to those implementing renderers (including open source renderers), it will have been worth while.- Second, David's argument glosses over the highly variable value of information encoded in these formats. A digital object is (roughly) encrypted information; if no renderer exists but the encrypted information is extremely valuable for some particular purpose, the specification might be considered as a key to enable some information to be extracted.- Thirdly, David's argument assumes, I think, quite complex formats. Many science data formats are comparatively simple, but may be currently accessed with proprietary software. Having the specification in those cases may well prove useful (OK, I don't have evidence for this as yet, I'll work on it!).- Fourth, older formats are simpler, and it would be good to have the specifications in some cases, even to help create open source renderers (is that a re-statement of the first? Maybe).So here's an example to illustrate the last point. I have commented elsewhere that the only files on the disk of the Mac I use to write this that are inaccessible to me, are PowerPoint (version 4.0) files created in the 1990s on an earlier Mac. I noted a comment from David:"In my, admittedly limited, experience Open Office often does better than the current Microsoft Office at rendering really old documents."Great, I thought; perhaps Open Office can render my old PowerPoints! And even better, there's now a native implementation of Open Office 3.0 for the Mac. So let's install it (and not talk about how hard it was to persuade it to give back control of my MS Office documents to the original software!). Does it open my errant files? No!So I would like someone to instigate a legacy documents project in Open Office, and implement as many as possible of the important legacy office document file formats. I think that would be a major contribution to long term preservation. Would it be simplified by having specifications available? Surely, surely it must be! In fact David admits as much:

"Effort should be devoted instead to using the specifications, and the access to a working renderer, to create an open source renderer now."

Well, you surely can't use specifications unless they are accessible and have been preserved...However, I must stress that I agree with what I take to be David's significant point, re-stated here as: the best Representation Information supporting preservation of information encoded in document formats is Open Source software. So "national libraries should consider collecting and preserving open source repositories". Yes!