Survey on malformed PDFs?

19 December, 2008
A DCC-Associates member asks
"Does anyone know there has been a study to estimate how many PDF documents
do not comply with the PDF standards?"
I've not heard of anything, nor can I find one with my best Google searches, but it's a particularly hard question to ask Google! So, if you know of anything that has happened or is in progress, please leave a comment here. Thanks,

hi chris,

i'm working on a paper that could indirectly answer your question. it's a study of the document metadata of popular filetypes 'in the wild', and examines the frequency, entropy, meaning, and format of metadata fields. (this is in contrast to previous work which has mostly just looked at what the spec says, or at one or two documents, but not the statistical behaviour across many thousand documents). i currently have data on xls, ppt, doc, and pdf files, but have only done the formal analysis on the office documents.

the data could easily be used to examine adherence to the PDF spec, and i'd be happy to share it. i also will likely do the analysis myself for the paper. however, this only addresses metadata, and not other ways in which a document can be non-compliant!

feel free to email me at jessy dot cowansharp at gmail dot com.

If it helps, Adobe Acrobat 9 Professional comes with a tool that can be used to check for compliance with the PDF specification. It's part of the Preflight feature.