Specifications again
6 January, 2009
The previous post was a summary with relatively little comment from me. I really liked David Rosenthal's related blog post, but I feel I do need to make some comments. I'm not sure this isn't yet another case of furiously agreeing!
Near the end of his post, following extensive argument based partly on his own experience of implementing from specifications in a "clean-room" environment, and a set of postulated explanations on why a specification might be useful, focusing on its potential use to write renderers, David writes the statement that makes me most uneasy:
- First, if the specification is available, it is (comparatively) extraordinarily cheap to keep. If it even makes a tiny difference to those implementing renderers (including open source renderers), it will have been worth while.
- Second, David's argument glosses over the highly variable value of information encoded in these formats. A digital object is (roughly) encrypted information; if no renderer exists but the encrypted information is extremely valuable for some particular purpose, the specification might be considered as a key to enable some information to be extracted.
- Thirdly, David's argument assumes, I think, quite complex formats. Many science data formats are comparatively simple, but may be currently accessed with proprietary software. Having the specification in those cases may well prove useful (OK, I don't have evidence for this as yet, I'll work on it!).
- Fourth, older formats are simpler, and it would be good to have the specifications in some cases, even to help create open source renderers (is that a re-statement of the first? Maybe).
So here's an example to illustrate the last point. I have commented elsewhere that the only files on the disk of the Mac I use to write this that are inaccessible to me, are PowerPoint (version 4.0) files created in the 1990s on an earlier Mac.
I noted a comment from David:
"In my, admittedly limited, experience Open Office often does better than the current Microsoft Office at rendering really old documents."
Great, I thought; perhaps Open Office can render my old PowerPoints! And even better, there's now a native implementation of Open Office 3.0 for the Mac. So let's install it (and not talk about how hard it was to persuade it to give back control of my MS Office documents to the original software!). Does it open my errant files? No!
So I would like someone to instigate a legacy documents project in Open Office, and implement as many as possible of the important legacy office document file formats. I think that would be a major contribution to long term preservation. Would it be simplified by having specifications available? Surely, surely it must be! In fact David admits as much:
However, I must stress that I agree with what I take to be David's significant point, re-stated here as: the best Representation Information supporting preservation of information encoded in document formats is Open Source software. So "national libraries should consider collecting and preserving open source repositories". Yes!
Near the end of his post, following extensive argument based partly on his own experience of implementing from specifications in a "clean-room" environment, and a set of postulated explanations on why a specification might be useful, focusing on its potential use to write renderers, David writes the statement that makes me most uneasy:
"It seems clear that preserving the specification for a format is unlikely to have any practical impact on the preservation of documents in that format."The suggested scenarios re missing renderers are:
"1. None was ever written because no-one in the Open Source community thought the format worth writing a renderer for...Read David's post for the detail of his arguments. However, I'd just like to suggest a few reasons why preserving specifications might be useful:
2. None was ever written because the owner of the format never released adequate specifications, or used DRM techniques to prevent, third-party renderers being written...
3. A open source renderer was written, but didn't work well enough because the released specifications weren't adequate, or because DRM techniques could not be sufficiently evaded or broken...
4. An open source renderer was written but didn't work well enough because the open source community lacked programmers good enough to do the job given the specifications and access to working renderers...
5. An open source renderer was written but in the interim was lost...
6. An adequate open source renderer was written, but in the interim stopped working..."
- First, if the specification is available, it is (comparatively) extraordinarily cheap to keep. If it even makes a tiny difference to those implementing renderers (including open source renderers), it will have been worth while.
- Second, David's argument glosses over the highly variable value of information encoded in these formats. A digital object is (roughly) encrypted information; if no renderer exists but the encrypted information is extremely valuable for some particular purpose, the specification might be considered as a key to enable some information to be extracted.
- Thirdly, David's argument assumes, I think, quite complex formats. Many science data formats are comparatively simple, but may be currently accessed with proprietary software. Having the specification in those cases may well prove useful (OK, I don't have evidence for this as yet, I'll work on it!).
- Fourth, older formats are simpler, and it would be good to have the specifications in some cases, even to help create open source renderers (is that a re-statement of the first? Maybe).
So here's an example to illustrate the last point. I have commented elsewhere that the only files on the disk of the Mac I use to write this that are inaccessible to me, are PowerPoint (version 4.0) files created in the 1990s on an earlier Mac.
I noted a comment from David:
"In my, admittedly limited, experience Open Office often does better than the current Microsoft Office at rendering really old documents."
Great, I thought; perhaps Open Office can render my old PowerPoints! And even better, there's now a native implementation of Open Office 3.0 for the Mac. So let's install it (and not talk about how hard it was to persuade it to give back control of my MS Office documents to the original software!). Does it open my errant files? No!

So I would like someone to instigate a legacy documents project in Open Office, and implement as many as possible of the important legacy office document file formats. I think that would be a major contribution to long term preservation. Would it be simplified by having specifications available? Surely, surely it must be! In fact David admits as much:
"Effort should be devoted instead to using the specifications, and the access to a working renderer, to create an open source renderer now."Well, you surely can't use specifications unless they are accessible and have been preserved...
However, I must stress that I agree with what I take to be David's significant point, re-stated here as: the best Representation Information supporting preservation of information encoded in document formats is Open Source software. So "national libraries should consider collecting and preserving open source repositories". Yes!
- Home
- Digital curation
- About us
- News
- Events
- Resources
- Briefing Papers
- Introduction to Curation
- Annotation
- Appraisal and Selection
- Curating Emails
- Curating e-Science Data
- Curating Geospatial Data
- Data Accreditation
- Data Citation and Linking
- Data Protection
- Database Archiving
- Digital Repositories
- Freedom of Information
- Genre Classification
- Interoperability
- Persistent Identifiers
- Trust Through Self Audit
- Using OAIS for Curation
- Web 2.0
- What is Digital Curation?
- Making the Case for RDM
- Research Data Readiness
- Legal Watch Papers
- Standards Watch Papers
- Technology Watch Papers
- Introduction to Curation
- How-to Guides
- Curation Reference Manual
- Peer review
- Editorial Board
- Completed chapters
- Appraisal and Selection
- Archival Metadata
- Archiving Web Resources
- Curating Emails
- File Formats
- Investment in an Intangible Asset
- Learning Object Metadata
- Metadata
- Ontologies
- Open Source for Digital Curation
- Preservation Metadata
- Preservation Strategies
- Principles for Enabling Access to Engineering Design Information Through Life
- The Role of Microfilm in Digital Preservation
- Chapters in production
- Curation Lifecycle Model
- Policy and legal
- Data Management Plans
- Tools
- Case studies
- Repository audit and assessment
- Standards
- Publications and presentations
- Roles
- Curation journals
- Informatics research
- External resources
- Briefing Papers
- Training
- Projects
- Community

Comments
I discuss this post in a comment on my original po...
I agree with you on this one, Chris, and would add...
Having a working renderer allows you to do what the renderer will support, but if I want to do something different with the data in the future, I need to know how the format is really put together. I can decipher that in time using an open-source renderer (assuming the renderer supports all features of the format), but having the specification handy makes the job much easier. As you point out, the relative cost of keeping a specification around, weighed against the potential benefits to someone who needs to write new software to work with a format, seems to me to argue for keeping both specifications and open-source renders around when they're available.
I'm also not as sanguine as David on the longevity of open source support for outdated formats. I think we should not confuse the utility of open source-supported formats for preservation with the support of the open source community for digital preservation as an activity. I find it all too easy to imagine scenarios in which support for older formats drops out of an open-source product over the long term, and even easier to imagine ones in which one open source product which supports an older format is abandoned by the open source community in favor of a different, newer one which lacks that support.
I agree with David on the value of keeping source code for renderers around. Having code for working with a format ready at hand would be invaluable for anyone having to code a renderer in the future. But if I have to bring a renderer for an older format to life at some point in the future, I'd rather have the code and the spec. sitting in front of me when I do it.
And I don't believe in software as a specification...
post of my own :-)
I agree with everyone on this thread so far, assum...
Default emphasis on primary value--i.e., the reason a resource was created in the first place--versus secondary value--i.e., the reason the resource (turns out to be valuable so) gets re-used, seems often to be quite different for those stewarding digital science data vs. those stewarding digital cultural materials. And when manifest as strong unspoken assumptions, this default difference can make these conversations treacherous, so we should attempt to surface it more explicitly. Also complicating matters are assumptions around scope; if a stewardship organization assumes that external forces will effectively preserve primary use, then their emphasis will understandably be on secondary use. However, if we take an overarching systemic view, I suspect most stewards would answer yes to the following question:
Regardless of how our different perspectives may affect our emphasis on secondary use, is it not true that our overall first priority should be to maintain access to the primary value, since it was recently, demonstrably, valuable? If so, then while recognizing the value in both, we should prioritize our investments in maintaining rendering tools over our investments in documenting format specs.
Why? Because (unlike with future secondary value) we can identify the primary value latent in a digital resource: it realizes when the resource is consumed as originally intended--whether by human senses or by some algorithmic process. Or in perhaps more applicable wording, the primary value is rendered in the intended primary experience. Rendering tools for this primary experience specifically and directly support access to this primary value in a way that format specifications do not, as David logically demonstrates.
It may indeed make economic sense to make the investment in obtaining and storing format specs when the required investment is relatively small, when it is the only available course of action, or for other equally logical reasons. However we should be willing to entertain the probability that--from a systemic, aggregate perspective--rendering tools may prove consistently more valuable than format specs, and adjust our policies accordingly.
Very nice post.But I can't belive the Software spe...
bloggers