Strand B1 research papers at IDCC4

9 December, 2008
In the morning parallel session B at the International Digital Curation Conference, the ever-interesting Jane Hunter from the University of Queensland began the session speaking about her Aus-e-Lit project (linked to Austlit) The project is based on FRBR, and offers federated search interfaces to related distributed databases. They include tagging and annotation services, and compound object authoring tools… based on an OAI-ORE based tool called Literature Online Research Environment (LORE). These compound objects, published to the semantic web as RDF named graphs, can express complex relationships between authors and works that represent parts of literary history, tracking lineage of derivative works and ideas, scholarly editions, research trails, and also used for teaching and learning objects. There were of course problems: the need for unique identifiers for non-information resources; use of local identifiers, use of non-persistent identifiers; community concern about ontology complexity… but also a desire for more complexity (wasn’t it ever thus)!

Hunter was followed by Kai Naumann, from the State Archive of Baden-Wurttemberg. They have many issues, starting with the very wide range of objects from paper, microfilm etc and beginning digital objects, of which a large number are now being ingested. Need to find resources with finding aids regardless of media and object type. Need to maintain and assure authenticity and integrity. They chose to use the PREMIS approach to representations (which reminded me again how annoying the clash of PREMIS terminology with that of OAIS is!). It seemed to be a very sophisticated approach. Naumann suggested that preservation metadata models should balance 3 aims: instant availability, easy ingest and long-term understandability. For low use (as in real archives), with heterogeneous objects, you need to design relatively simple metadata sets. It’s important to maintain relational integrity between the content and metadata (I’m not sure that has a strict, relational database meaning!). Structural relations between content units can be a critical for authenticity.

Jim Downing of Cambridge came next, speaking for Pete Sefton, USQ (although he admitted he was a pale, short, damp substitute for the real thing!). The topic was embedding metadata & semantics in documents; they work together on the JISC-funded ICE-Theorem project. We need semantically rich documents for science, but most documents start in a (business-oriented) word processor. Semantically rich documents enable automation, reduce information loss, have better discovery and presentation. Unfortunately, users (and I think, authors) don’t really distinguish between metadata, semantics and data. Document and metadata creation are not really separable. Documents often have multiple, widely separated authors, particularly in the sciences. Their approach is to make it work with MS Word & OpenOffice Writer, & use Sefton’s ICE system. They need encodable semantics that are round-trip survivable: if created by a rich tool, then processed with a vanilla tool then checked again with the rich tool, the semantics should still be present, for real interoperability. Things they thought of but didn’t do: MS Word Smart Tags, MS Word foreign namespace XML, ODF embedded semantics, anything that breaks WYSIWYG such as wiki markup in document), new encoding standards. What does work? An approach based on microformats (I thought simply labelling it as microformats was a bit of an over-statement, given the apparent glacial pace of “official” microformat standardisation!). They will overload semantics into existing capabilities, eg tables, styles, links, frames, bookmarks, fields (some still fragile).

Paul McKeown from EMC Corp gave a paper written by Stephen Todd on XAM, a standard newly published by SNIA. It originated from the SNIA 100 year archive survey report. Separating logical and physical data formats, and the need over time to migrate content to new archives. Location-independent object naming (XUID), with rich metadata, and pluggable architecture for storage system support. Application tool talks to XAM layer which abstracts away the vendor implementations. 3 primary objects: XAM library (object factory?), XSet, XStreams, etc. Has a retention model. (Sorry Paul, Stephen, I guess I was getting tired at this point!)