Semantically richer PDF?
6 April, 2009 | in Blogs
By: Chris Rusbridge
PDF is very important for the academic world, being the document format of choice for most journal publishers. Not everyone is happy about that, partly because reading page-oriented PDF documents on screen (especially that expletive-deleted double-column layout) can be a nightmare, but also because PDF documents can be a bit of a semantic desert. Yes, you can include links in modern PDFs, and yes, you can include some document or section metadata. But tagging the human-readable text with machine-readable elements remains difficult.
In XHTML there are various ways to do this, including microformats (see Wikipedia). For example, you can use the hcard microformat to encode identifying contact information about a person (confusingly hcard is based on the vcard standard). However, there are relatively few microformats agreed. For example, last time I checked, development of the hcite microformat for encoding citations appeared to be progressing rather slowly, and still some way from agreement.
The alternative, more general approach seems to be to use RDF; this is potentially much more useful for the wide range of vocabularies needed in scholarly documents. RDFa is a mechanism for including RDF in XHTML documents (W3C, 2008).
RDF has advantages in that it is semantically rich, precise, susceptible to reasoning, but syntax-free (or perhaps, realisable with a range of notations, cf N3 vs RDFa vs RDF/XML). With RDF you can distinguish “He” (the chemical element) from “he” (the pronoun), and associate the former with its standard identifier, chemical properties etc. For the citation example, the CLADDIER project made suggestions and gave examples, for example of encoding a citation in RDF (Matthews, Portwin, Jones, & Lawrence, 2007).
PDF can include XMP metadata, which is XML-encoded and based on RDF (Adobe, 2005). Job done? Unfortunately not yet, as far as I can see. XMP applies to metadata at the document or major component level. I don’t think it can easily apply to fine-grained elements of the text in the way I’ve been suggesting (in fact the specification says “In general, XMP is not designed to be used with very fine-grained subcomponents, such as words or characters”). Nevertheless, it does show that Adobe is sympathetic towards RDF.
Can we add RDF tagging associated with arbitrary strings in a PDF document in any other ways? It looks like the right place would be in PDF annotations; this is where links are encoded, along with other options like text callouts. I wonder if it is possible simply to insert some arbitrary RDF in a text annotation? This could look pretty ugly, but I think annotations can be set as hidden, and there may be an alternate text representation possible. It might be possible to devise an appropriate convention for a RDF annotation, or use the extensions/plugin mechanism that PDF allows. A disadvantage of this is that PDF/A (ISO, 2005) disallows extensions to PDF as defined in PDF Reference (Adobe, 2007), but PDF/A is important for long-term archiving (ie that such extensions are not compatible with long-term archiving). I don’t know whether we could persuade Adobe to add this to a later version of the standard. If something like this became useful and successful, time would be on our side!
What RDF syntax or notation should be used? To be honest, I have no idea; I would assume that something compatible with what’s used in XMP would be appropriate; at least the tools that create the PDF should be capable of handling it. However, this is less help in deciding than one might expect, as the XMP specification says “Any valid RDF shorthand may be used”. Nevertheless, in XMP RDF is embedded in XML, which would make both RDF/XML and RDFa possibilities.
So, we have a potential place to encode RDF, now we need a way to get it into the PDF, and then ways to process it when the PDF is read by tools rather than humans (ie text mining tools). In Chemistry, there are beginning to be options for the encoding. We assume that people do NOT author in PDF; they write using Word or OpenOffice (or perhaps LaTeX, but that’s another story).
Of relevance here is the ICE-TheOREM work between Peter Murray-Rust’s group at Cambridge, and Pete Sefton’s group at USQ; this approach is based on either MS Word or OpenOffice for the authors (of theses, in that particular project), and produces XHTML or PDF, so it looks like a good place to start. Peter MR is also beginning to talk about the Chem4Word project they have had with Microsoft, “an Add-In for Word2007 which provides semantic and ontological authoring for chemistry”. And the ChemSpider folk have ChemMantis, a “document markup system for Chemistry-related documents”. In each of these cases, the authors must have some method of indicating their semantic intentions, but in each case, that is the point of the tools. So there’s at least one field where some base semantic generation tools exist that could be extended.
PDFBox seems to be a common tool for processing PDFs once created; I know too little about it to know if it could easily be extended to handle RDF embedded in this way.
So I have two questions. First, is this bonkers? I’ve had some wrong ideas in this area before (eg I thought for a while that Tagged PDF might be a way to achieve this). My second question is: anyone interested in a rapid innovation project under the current JISC call, to prototype RDF in PDF files via annotations?
References:
Adobe. (2005). XMP Specification. San Jose.
Adobe. (2007). PDF Reference and related Documentation.
ISO. (2005). ISO 19005-1:2005 Document management — Electronic document file format for long-term preservation — Part 1: Use of PDF 1.4 (PDF/A-1).
Matthews, B., Portwin, K., Jones, C., & Lawrence, B. (2007). CLADDIER Project Report III: Recommendations for Data/Publication Linkage: STFC, Rutherford Appleton Laboratory.
W3C. (2008). RDFa Primer: Bridging the Human and Data Webs. Retrieved 6 April, 2009, from http://www.w3.org/TR/xhtml-rdfa-primer/
In XHTML there are various ways to do this, including microformats (see Wikipedia). For example, you can use the hcard microformat to encode identifying contact information about a person (confusingly hcard is based on the vcard standard). However, there are relatively few microformats agreed. For example, last time I checked, development of the hcite microformat for encoding citations appeared to be progressing rather slowly, and still some way from agreement.
The alternative, more general approach seems to be to use RDF; this is potentially much more useful for the wide range of vocabularies needed in scholarly documents. RDFa is a mechanism for including RDF in XHTML documents (W3C, 2008).
RDF has advantages in that it is semantically rich, precise, susceptible to reasoning, but syntax-free (or perhaps, realisable with a range of notations, cf N3 vs RDFa vs RDF/XML). With RDF you can distinguish “He” (the chemical element) from “he” (the pronoun), and associate the former with its standard identifier, chemical properties etc. For the citation example, the CLADDIER project made suggestions and gave examples, for example of encoding a citation in RDF (Matthews, Portwin, Jones, & Lawrence, 2007).
PDF can include XMP metadata, which is XML-encoded and based on RDF (Adobe, 2005). Job done? Unfortunately not yet, as far as I can see. XMP applies to metadata at the document or major component level. I don’t think it can easily apply to fine-grained elements of the text in the way I’ve been suggesting (in fact the specification says “In general, XMP is not designed to be used with very fine-grained subcomponents, such as words or characters”). Nevertheless, it does show that Adobe is sympathetic towards RDF.
Can we add RDF tagging associated with arbitrary strings in a PDF document in any other ways? It looks like the right place would be in PDF annotations; this is where links are encoded, along with other options like text callouts. I wonder if it is possible simply to insert some arbitrary RDF in a text annotation? This could look pretty ugly, but I think annotations can be set as hidden, and there may be an alternate text representation possible. It might be possible to devise an appropriate convention for a RDF annotation, or use the extensions/plugin mechanism that PDF allows. A disadvantage of this is that PDF/A (ISO, 2005) disallows extensions to PDF as defined in PDF Reference (Adobe, 2007), but PDF/A is important for long-term archiving (ie that such extensions are not compatible with long-term archiving). I don’t know whether we could persuade Adobe to add this to a later version of the standard. If something like this became useful and successful, time would be on our side!
What RDF syntax or notation should be used? To be honest, I have no idea; I would assume that something compatible with what’s used in XMP would be appropriate; at least the tools that create the PDF should be capable of handling it. However, this is less help in deciding than one might expect, as the XMP specification says “Any valid RDF shorthand may be used”. Nevertheless, in XMP RDF is embedded in XML, which would make both RDF/XML and RDFa possibilities.
So, we have a potential place to encode RDF, now we need a way to get it into the PDF, and then ways to process it when the PDF is read by tools rather than humans (ie text mining tools). In Chemistry, there are beginning to be options for the encoding. We assume that people do NOT author in PDF; they write using Word or OpenOffice (or perhaps LaTeX, but that’s another story).
Of relevance here is the ICE-TheOREM work between Peter Murray-Rust’s group at Cambridge, and Pete Sefton’s group at USQ; this approach is based on either MS Word or OpenOffice for the authors (of theses, in that particular project), and produces XHTML or PDF, so it looks like a good place to start. Peter MR is also beginning to talk about the Chem4Word project they have had with Microsoft, “an Add-In for Word2007 which provides semantic and ontological authoring for chemistry”. And the ChemSpider folk have ChemMantis, a “document markup system for Chemistry-related documents”. In each of these cases, the authors must have some method of indicating their semantic intentions, but in each case, that is the point of the tools. So there’s at least one field where some base semantic generation tools exist that could be extended.
PDFBox seems to be a common tool for processing PDFs once created; I know too little about it to know if it could easily be extended to handle RDF embedded in this way.
So I have two questions. First, is this bonkers? I’ve had some wrong ideas in this area before (eg I thought for a while that Tagged PDF might be a way to achieve this). My second question is: anyone interested in a rapid innovation project under the current JISC call, to prototype RDF in PDF files via annotations?
References:
Adobe. (2005). XMP Specification. San Jose.
Adobe. (2007). PDF Reference and related Documentation.
ISO. (2005). ISO 19005-1:2005 Document management — Electronic document file format for long-term preservation — Part 1: Use of PDF 1.4 (PDF/A-1).
Matthews, B., Portwin, K., Jones, C., & Lawrence, B. (2007). CLADDIER Project Report III: Recommendations for Data/Publication Linkage: STFC, Rutherford Appleton Laboratory.
W3C. (2008). RDFa Primer: Bridging the Human and Data Webs. Retrieved 6 April, 2009, from http://www.w3.org/TR/xhtml-rdfa-primer/
- Home
- Digital Curation
- About Us
- News
- Events
- Resources
- Briefing Papers
- Introduction to Curation
- Annotation
- Appraisal and Selection
- Curating emails
- Curating e-science data
- Curating geospatial data
- Data accreditation
- Data Citation and Linking
- Data protection
- Database archiving
- Digital repositories
- Freedom of Information
- Genre classification
- Interoperability
- Persistent Identifiers
- Trust through self audit
- Using OAIS for curation
- Web 2.0
- What is digital curation?
- Legal Watch Papers
- Standards Watch Papers
- Technology Watch Papers
- Making the Case for RDM
- Introduction to Curation
- How-to Guides
- Curation Reference Manual
- Peer review
- Editorial board
- Completed chapters
- Appraisal and Selection
- Archival Metadata
- Archiving Web Resources
- Curating Emails
- File Formats
- Investment in an Intangible Asset
- Learning Object Metadata
- Metadata
- Ontologies
- Open Source for Digital Curation
- Preservation Metadata
- Preservation Strategies
- Principles for Enabling Access to Engineering Design Information Through Life
- Chapters in production
- Curation Lifecycle Model
- Policy and legal
- Data Management Plans
- Case studies
- Tools and applications
- Standards
- Publications
- External resources
- Roles
- Curation journals
- Informatics research
- Briefing Papers
- Training
- Projects
- Community
- Contact Us
Curation training
Curation training
Looking to develop your data management and curation skills? Learning is easy when you sign up for any of our introductions to digital curation, which cover all those activities you need to consider when planning and implementing new projects.

Comments
Possibly related is GeoPDF, a way of embedding pro...
Chris, I think that this is an idea worth explorin...
If we can capture semantics like that then yes, they probably could be embedded in PDF later using something like annotations. One simple thing we do with our ICE project is to turn links into endnotes for the PDF - so that might be another human-readable approach to express semantics.
Of course none of this is of any use and will therefore not be adopted unless there are services which make use the semantics, such as repositories that automatically understand the metadata. Journal or thesis submission guidelines might also help as the purpose is then clear - do it our way or no publication.
@Peter_Sefton, I did mention a few chemical semant...
If we all agreed with the chicken and egg worry, it would be hard to make progress. I do think that RDF is important, and semantically rich journal articles are important, and that if we can create semantically rich journal articles based on RDF we will have the prospect of much better science in the future!
Great article!Just a couple of points...First, the...
Just a couple of points...
First, there are ways (in the existing PDF file format) to incorporate arbitrary information (be it XML/RDF/XMP or other formats) into a PDF and associate it with specific content elements. It's called "Tagged PDF", has been part of PDF since 1.4, is fully compatible with PDF/A-1 and is how things such as reflow and accessibility work in PDF today. However, there aren't any standards for what to put there (beyond the structure elements defined in ISO 32000-1, aka ISO PDF) that it can be used outside of a closed workflow.
You'll be happy to know that I've been working on exactly this problem recently - trying to standardize more semantic information in PDF at more fine grained levels. I have a proposal pending for the next revision of ISO 32000 that will help.
Finally, I'd like to correct something you said which is "A disadvantage of this is that PDF/A (ISO, 2005) disallows extensions to PDF as defined in PDF Reference (Adobe, 2007)". This is not actually true. PDF/A-1 prohibits SPECIFIC THINGS - but if something isn't specifically prohibited, then it's permitted. HOWEVER, a "conforming reader" is not required to process any of it.
Leonard Rosenthol
PDF Standards Architect, Adobe Systems
ISO Project Leader, ISO 19005 (PDF/A)
@Leonard, thanks for your comment. Sorry if I misr...
I've been interested in Tagged PDFs for a while, although they appear not to be widely used, and I got the impression, probably wrongly, that Adobe had decided to do accessibility in different ways. From what I remember, it looked like it made sense to Tag at the section/subsection level, but almost certainly not at the level of a string that one wishes to markup as an object such as a chemical compound, chemical reaction, etc.
So I am extremely pleased that Adobe has a proposal for more fine-grained semantic information for the main ISI 32000 standard! Let's hope we can get something like that widely implemented, as it would surely complement and encourage semantic authoring tools that are beginning to emerge, and will be important in the world of research. I even started to wonder about GRDDL-for-PDF!