IDCC11 Session A3: Formats

13 December, 2011

Blog preservation was the main concern of Yunhyong Kim, of University of Glasgow and the BlogForever FP7 project. As everyone knows, timing is everything in the blogosphere. However the time when an average blog consisted of an occasionally updated 'dear diary' in a single HTML page has long passed. Today’s blogs tend to be rather complex and dynamic objects, whose look and feel is often platform dependent. Capturing all the necessary elements and their changes over time (provenance) is difficult. It is a challenge to keep things together when in the web environment they may not stay together for very long. And, as Yunhyong noted, if you don’t keep them together you can’t render the page.

Yunhyong reported on her work (with Seamus Ross) to evaluate current preservation formats that best fit this kind of content. While these are reasonably well established for structured texts, audio and video, there are few for aggregates of these constituent parts. One source of trouble is the range of dependencies that are normally well hidden from the blog reader; varying permission levels associated with components, file modification dates, user names, lists of deleted files, process logs, trails of programs run, and so on.  These are normally retained on system disks, but are of course disaggregated from the published object.

To fulfil the needs of a good format, Yunhyong identified a range of characteristics; completeness, recoverability, in-built validation, scaleability, transparency, and flexibility in handling metadata and data. Considering various types of format, she described a comparative evaluation of three: tar, WARC,and aff (Advanced Forensic File) format. The latter she found the more robust. Aff has the distinct advantage of allowing data-mining without decompressing or disturbing the content. One disadvantage is that it takes an entire disk rather than a file as its input. Yunhyong worked around this by conceptualising the archive as a collection of virtual disk images, each keeping together the dynamic blog or website harvested. This might, I suppose, also work as an approach to self-archiving complex web sites with their constituent files, their attributes, and any server-side components needed to render them as intended.

Scientific data formats, and a different class of problem, were discussed next by Chris Frisz of Indiana University. Migrating data from one format to another calls for close attention to the way that different software applications, and versions of them, render the results. Consider the once eponymous and now long obsolete Lotus-123 spreadsheet; experiments by Chris and colleagues showed that migration to Excel is almost risk free. Existing conversion tools can be used safely, except for some formulae where 123 and Excel perform certain operations in different orders. Identifying such issues allows the archivist to take special precautions for affected files.

A similar approach was taken to files using the array data format CDF, and their conversion to NetCDF.  The result can be metadata loss. NetCDF didn’t inherit the ‘epoch’ data type commony used for high-resolution time data. It also introduced descriptive ‘named dimensions’, usable for data access but not present in CDF. Both potentially lead to mismatches. Issues with epoch data can be readily assessed using existing conversion tools, but named dimensions need to be handled with care if converting to CDF. Chris and colleagues also applied this risk assessment approach to migration from HDF (Hierarchical Data Format), commonly used for organising large amounts of numerical data. The latest version (HDF5) is not backwards-compatible, the good news being that no significant risks were found for archival purposes. The fndings support use of simple and fast tools for migration risk analysis. Predictably, the Indiana University group have found that open formats (e.g. CDF, netCDF, HDF) are easier to analyze than proprietary ones (i.e. Lotus 1-2-3).

Ross Spencer from the National Archives in the UK (TNA) gave the third presentation. Ross set out a case for building a test corpus of digital objects that would support evaluation of file format identification tools and signatures.  TNA are responsible for PRONOM the file format registry, and the leading format identification tool DROID. The work he presented was carried out by Andrew Fetherston and Tim Gollins at TNA, and lays some groundwork for a test corpus. The main case for a corpus is to help test file identification tools, so as to verify their results in the same transparent manner that, for example, the US National Institute for Standards in Technology (NIST) established for text retrieval through its TREC evaluation conferences.

A corpus for this domain needs to include a representative collection of digital objects whose provenance can be well established. One example Ross gave was the Waterloo Repertoire, which comprises image files that can be used for a quantitative comparison of image compression programs. Community involvement is needed to establish selection criteria for the contents, identify the provenance and the metadata to record this.

To be taken up by the community a test corpora will need to be persistent, in a trusted location, and ensure accurate version control. To be usable, the organisation of the test objects will need to be agreed, e.g. by format, file type, submitter etc. Other criteria Ross highlighted were unsurprising and within the scope of ‘trusted’ repositories: rights clearance, integrity controls, processes for secure storage and backup, and access controls.

The aim of such collections is of course to establish metrics that can be used to test tools and assess how well they perform on the corpus. Ross outlined work to adapt the precision and recall measures established in the text retrieval domain. So for example a tool that would successfully identify all the JPEG files in the collection would have a recall value of 1.0, even if it also had a number of ‘false positives’ such as counting Tiff files as JPEG. On the other hand a tool correctly identifying only JPEG files as JPEG files would have a precision score of 1.0, even if it had some ‘false negatives’ by classifying some JPEGs as Tiffs.

Some interesting discussion followed this talk. A key point that I think applied to much of the session was that we are missing a trick by describing format issues as about ‘preservation’ rather than about something of more widespread concern such as ‘e-discovery’. And relatedly, a comment on the analogy with TREC was that this lost much of its influence when Google and other commercial players came to dominate the text retrieval field by adopting and improving on techniques once openly tested in that forum.  Putting these comments together suggests that the topics of this session ought to be seen as rather more glorious than they perhaps are.