A while back, I posted a question on superuser.com about a PDF issue that’s causing problems in JHOVE. So far it hasn’t gotten any answers, so I’m signal-boosting my own question here. Here’s what I asked:
The JHOVE parser for PDF, which I maintain, will sometimes find a non-dictionary object in a PDF’s Annots array. According to section 8.4.1 of the PDF spec, the Annots array holds “an array of annotation dictionaries.” In the case that I’m looking at right now, there’s a keyword of “Annot” instead of a dictionary. Is this an invalid PDF file, or is there a subtlety in the spec which I’ve overlooked?
Answering on stackoverflow.com is best, so other people can see the answer, but if you prefer to answer here, I’ll post or summarize any useful response, with attribution, as an answer over there.
I can’t help unfortunately, so disregard the rest of this comment if you were hoping for an answer.
I do however find these details interesting as they highlight fundamental questions about the goals of digital preservation.
Standards will always be a matter of interpretation to a degree. For example, JHOVE, Acrobat Reader, Google Chrome all interpret pdf files in order to do something with them, and they may well (and probably do) interpret them slightly differently.
Your question’s lack of answers may be a result of their being no one right answer as the standard may be ambiguous*. But it does beg the question, what does “valid” really mean? And does it matter (from a preservation perspective) whether someone configured their pdf file creation software so it put “Annot” in a PDF’s annot array? If so, why? Why does it matter if a file is “invalid” in this way?
I’d guess the answer is something like “well, if you are going to normalize or migrate the files then you need to know how they are structured”(*. Again I’d ask why? And perhaps I’d be told “So you can make sure your migration tool supports migrating that content and test to make sure it is preserved post-migration”.
All I can say is that those migration tools are going to have to be pretty comprehensive, and so are the new rendering tools (to be used to render the migrated/normalized content). I suspect that is going to be unrealistic as an approach for all types of files. And if so, then content will be lost. But then again, who really cares about a pdf file’s annot array anyway?
*It may just be through lack of expert users noticing it, that does seem more likely.
**you may just want to know so you can reject invalid files. If so, ignore this comment, its irrelevant (though again those particular files are then being “lost”).
The point is more that future tools for reading PDF files may rely on the spec; files which current PDF readers treat forgivingly may cause future readers to crash. It’s valuable to know if a file is doing something that’s based on Adobe tradition (or whatever) rather than on a strict reading of the spec.
This raises a different aspect of the question “What does ‘valid’ really mean?”, though. Taking only the spec as a criterion is a fundamentalist view: Everything must be judged by the sacred text. Fundamentalism does have the advantage of being unambiguous (or at least no more ambiguous than the sacred text). A reformist view takes traditions into account, but then you have to decide when a variant is well-established enough to count as a tradition. JHOVE takes a mostly fundamentalist approach to formats. It’s not the approach I would have taken if I’d designed it from the beginning, but it has its uses and I’d create confusion by changing it. By flagging this deviation from the spec, JHOVE makes it evident that there is an issue, and people can decide what to do about it. It would be better if it could take a more flexible approach, controlled by a configuration file, but that’s a bigger project than I want to undertake without someone paying me to do it.