I’m working toward a Kickstarter proposal that will cover what I’ll try to make two weeks of work. That will let me set a relatively modest funding goal, which seems wise for a first project. As I look at things more closely, a PDF 1.7 upgrade as such looks like the wrong way to go; I’m not seeing PDF 1.7 features that break JHOVE. What I’m seeing, rather, is an assortment of problems in the PDF module that can break files for reasons not closely tied to their version, and some features that would be very nice to add.
Here’s a first go-around on issues with JHOVE 1.8 which I’m considering addressing. I’m open to other issues as well. Comments on which of these are most important would help me to set up my project proposal.
• PDF files that open with Acrobat and with OS X Preview are being declared not well-formed or not valid because JHOVE is encountering one kind of PDF object when it expects another. Figuring this out may require some digging into the specs and implementation notes. Bug reports which I’ve added myself on SourceForge include Casting exceptions in PDF module, Problem with PDF annotation dictionaries, and PDF module doesn’t recognize all encryption algorithms.
• The PDF module recognizes PDF/X through version 3 but not 4. Adding a profile for PDF/X-4 looks simple.
• The PDF module recognizes PDF/A-1 but not 2 and 3. This could be a significant amount of work, possibly less if I incorporate a third-party library.
• The PDF module doesn’t recognize encryption algorithm 5, although it’s been around since PDF 1.5. The fix for this is easy.
• In general, PDF module error messages are less useful than they should be. More specific information and better logging would help in diagnosing any issues.
• The optional document requirements dictionary is a new feature in 1.7. This gives information on what an application needs to do to process the document correctly, which sounds like a useful preservation feature. Reporting a requirements property would be good.
• “Portable collections” of embedded files are a new feature in PDF 1.7. This sounds like a property worth reporting.
• AIFF files created by iTunes, and perhaps by other software, use the “ID3” chunk for metadata. The AIFF module knows nothing about it. Parsing this chunk and reporting the metadata might be useful. I’ve already written code that does this for another project.
• The UTF-8 module is at Unicode 6.0, and Unicode is at 6.2. Updating this is straightforward grunt work.
• There’s been a request to check if TIFF files share storage between tags, which is not allowed by the spec. This most often happens by design when two values, such as XResolution and YResolution, are known to have the same value, but could also indicate a corrupted file.
• There has been some discussion of checking whether TIFF tile and strip content goes outside file boundaries. There are limits on how rigorously that can be done, since lengths depend on compression methods that are outside JHOVE’s scope, but some idiot-proofing is possible.
All of this together is far more than two weeks of work, of course. Which of these are most important to you? What else should be on the list?