My venture into the Techno-Liberty blog didn’t work so well. In fact, I’m getting more views on this blog, in spite of not having posted in months, than I got on my best days on the other blog. So … I’m back.
JHOVE is still doing well too, thanks to excellent work by Carl Wilson and others at the Open Preservation Foundation. There will be an online hack day for JHOVE on April 27. The aim is to find ways to improve JHOVE by improving error reporting, collecting example files, and documenting the preservation impact of JHOVE validation issues. (I think that last one means “Why does McGath’s PDF module suck?” :)
The time listed is 8 AM-8 PM. I asked what time zone that is, and was told it means any and all, from New Zealand the long way around to Hawaii.
Last time I said I’d drop in and didn’t really manage to. This time I won’t make promises, but I’ll try to be around in some form. If nothing else, people can ask me questions about JHOVE in the comments.
Trying to make this not sound like a comment on the PDF module, as I’m a new user with questions and not yet any useful comments. We (Canadiana.org) have been wanting to adopt JHOVE validation, but PDF files have been a barrier. We use ABBYY to generate single-page PDF files which JHOVE has no problem with, but when we use pdftk or poppler to join them we end up with files that don’t validate. I’ve asked questions in the JHOVE and poppler email forums about this, as well as sending an email to the author of pdftk.
Are PDF files just a problem to be able to identify and validate? Should people wanting to retain documents in a long-term archive be looking elsewhere than PDF? I found http://verapdf.org , but all it did was confirm the message that JHOVE gave me which is that the files weren’t valid, but not getting me any closer to what narrow set of tools can be used to manage PDF files and keep them valid with JHOVE and veraPDF.
There are three big issues.
The first problem is that PDF is a really complicated format, and the spec isn’t always clear on what is allowed. The spec tends to describe recommended practices rather than permitted ones. When I followed that for JHOVE, I discovered later on that alternative data types often are used and accepted. Anything which Acrobat accepts is the de facto standard.
The second problem is with PDF/A, which many people would really like JHOVE to identify accurately. It’s implemented as a “profile” of PDF. The code is trying to validate the file as PDF, meanwhile noting whether or not it’s PDF/A compliant as it goes. Among other problems, this means it doesn’t tell you _why_ a file isn’t PDF/A compliant. A separate PDF/A module (sharing most of the PDF codebase) might be a better approach.
Third, by design JHOVE doesn’t look at data streams (admittedly an ill-defined concept). This was a matter of the resources that were available to produce all the modules at Harvard. JHOVE doesn’t do full validation on files, and will pass files that might be unrenderable.
These problems reflect on PDF’s complexity, which suggests the question of whether it’s too messy a format for long-term preservation. Other formats, like EPUB, are simpler and more transparent. How much of a problem this is is a subject of ongoing debate.