I’m working toward a Kickstarter proposal that will cover what I’ll try to make two weeks of work. That will let me set a relatively modest funding goal, which seems wise for a first project. As I look at things more closely, a PDF 1.7 upgrade as such looks like the wrong way to go; I’m not seeing PDF 1.7 features that break JHOVE. What I’m seeing, rather, is an assortment of problems in the PDF module that can break files for reasons not closely tied to their version, and some features that would be very nice to add.
Here’s a first go-around on issues with JHOVE 1.8 which I’m considering addressing. I’m open to other issues as well. Comments on which of these are most important would help me to set up my project proposal.
• PDF files that open with Acrobat and with OS X Preview are being declared not well-formed or not valid because JHOVE is encountering one kind of PDF object when it expects another. Figuring this out may require some digging into the specs and implementation notes. Bug reports which I’ve added myself on SourceForge include Casting exceptions in PDF module, Problem with PDF annotation dictionaries, and PDF module doesn’t recognize all encryption algorithms.
• The PDF module recognizes PDF/X through version 3 but not 4. Adding a profile for PDF/X-4 looks simple.
• The PDF module recognizes PDF/A-1 but not 2 and 3. This could be a significant amount of work, possibly less if I incorporate a third-party library.
• The PDF module doesn’t recognize encryption algorithm 5, although it’s been around since PDF 1.5. The fix for this is easy.
• In general, PDF module error messages are less useful than they should be. More specific information and better logging would help in diagnosing any issues.
• The optional document requirements dictionary is a new feature in 1.7. This gives information on what an application needs to do to process the document correctly, which sounds like a useful preservation feature. Reporting a requirements property would be good.
• “Portable collections” of embedded files are a new feature in PDF 1.7. This sounds like a property worth reporting.
• AIFF files created by iTunes, and perhaps by other software, use the “ID3” chunk for metadata. The AIFF module knows nothing about it. Parsing this chunk and reporting the metadata might be useful. I’ve already written code that does this for another project.
• The UTF-8 module is at Unicode 6.0, and Unicode is at 6.2. Updating this is straightforward grunt work.
• There’s been a request to check if TIFF files share storage between tags, which is not allowed by the spec. This most often happens by design when two values, such as XResolution and YResolution, are known to have the same value, but could also indicate a corrupted file.
• There has been some discussion of checking whether TIFF tile and strip content goes outside file boundaries. There are limits on how rigorously that can be done, since lengths depend on compression methods that are outside JHOVE’s scope, but some idiot-proofing is possible.
All of this together is far more than two weeks of work, of course. Which of these are most important to you? What else should be on the list?
The disappearing format blues
Old formats sometimes fade into obscurity and can no longer be supported, even if they come from a big company like Microsoft. Chris Rusbridge has noted that Microsoft’s Open Specifications page only goes as far back as Office 97, and that PowerPoint 4.0 files can’t be opened with today’s Microsoft Office. Tony Hey at Microsoft has replied. (Hey is vice president of Microsoft Research Connections). The response was encouraging, particularly in suggesting that Microsoft might “participate in a ‘crowd source’ project working with archivists to create a public spec of these old file formats.”
There’s usually some kind of software around that can read old formats. A search turns doesn’t turn up a lot; there’s something called PowerPressed, which will wrap old PowerPoint files in a .exe application. It looks as if it should run on current Windows systems, but all I know is what that page says.
The situation shows the risk of using a format that isn’t publicly documented. Today this is less of a problem. I think it’s been shown that publishing format specs doesn’t lead to cannibalization of sales by competing software; the company that created the spec is in a position to produce the best implementation. The description of PDF is fully public, and Adobe still dominates the market for PostScript software. Publishing the spec has just made the pie bigger. There’s still quite a lot of software that uses unpublished proprietary specs, though, and it’s risky to rely on the long-term reliability of the files they produce.
1 Comment
Posted in commentary
Tagged Microsoft, preservation, standards