pdf

Scholarly HTML would be nice, but...

I'm quite interested in the idea of Scholarly HTML, as espoused in Pete Sefton's blog, and I've commented on some of Peter Murray Rust's hamburger PDF comments previously (although I do think a lot of people confuse wild PDF with well-made, should one say Scholarly PDF). I've always been slightly worried by one thing though.

Read more >

Semantically richer PDF?

PDF is very important for the academic world, being the document format of choice for most journal publishers. Not everyone is happy about that, partly because reading page-oriented PDF documents on screen (especially that expletive-deleted double-column layout) can be a nightmare, but also because PDF documents can be a bit of a semantic desert. Yes, you can include links in modern PDFs, and yes, you can include some document or section metadata. But tagging the human-readable text with machine-readable elements remains difficult.

Read more >

Survey on malformed PDFs?

A DCC-Associates member asks"Does anyone know there has been a study to estimate how many PDF documentsdo not comply with the PDF standards?"I've not heard of anything, nor can I find one with my best Google searches, but it's a particularly hard question to ask Google! So, if you know of anything that has happened or is in progress, please leave a comment here. Thanks,

Read more >

Science publishing, workflow, PDF and Text Mining

… or, A Semantic Web for science?It’s clear that the million or so scientific articles published each year contain lots of science. Most of that science is accessible to scientists in the relevant discipline. Some may be accessible to interested amateurs. Some may also be accessible (perhaps in a different sense) to robots that can extract science facts and data from articles, and record them in databases.

Read more >

The DCC is funded by

Joint Information Systems Committee