I’m working toward a Kickstarter proposal that will cover what I’ll try to make two weeks of work. That will let me set a relatively modest funding goal, which seems wise for a first project. As I look at things more closely, a PDF 1.7 upgrade as such looks like the wrong way to go; I’m not seeing PDF 1.7 features that break JHOVE. What I’m seeing, rather, is an assortment of problems in the PDF module that can break files for reasons not closely tied to their version, and some features that would be very nice to add.
Here’s a first go-around on issues with JHOVE 1.8 which I’m considering addressing. I’m open to other issues as well. Comments on which of these are most important would help me to set up my project proposal.
• PDF files that open with Acrobat and with OS X Preview are being declared not well-formed or not valid because JHOVE is encountering one kind of PDF object when it expects another. Figuring this out may require some digging into the specs and implementation notes. Bug reports which I’ve added myself on SourceForge include Casting exceptions in PDF module, Problem with PDF annotation dictionaries, and PDF module doesn’t recognize all encryption algorithms.
• The PDF module recognizes PDF/X through version 3 but not 4. Adding a profile for PDF/X-4 looks simple.
• The PDF module recognizes PDF/A-1 but not 2 and 3. This could be a significant amount of work, possibly less if I incorporate a third-party library.
• The PDF module doesn’t recognize encryption algorithm 5, although it’s been around since PDF 1.5. The fix for this is easy.
• In general, PDF module error messages are less useful than they should be. More specific information and better logging would help in diagnosing any issues.
• The optional document requirements dictionary is a new feature in 1.7. This gives information on what an application needs to do to process the document correctly, which sounds like a useful preservation feature. Reporting a requirements property would be good.
• “Portable collections” of embedded files are a new feature in PDF 1.7. This sounds like a property worth reporting.
• AIFF files created by iTunes, and perhaps by other software, use the “ID3” chunk for metadata. The AIFF module knows nothing about it. Parsing this chunk and reporting the metadata might be useful. I’ve already written code that does this for another project.
• The UTF-8 module is at Unicode 6.0, and Unicode is at 6.2. Updating this is straightforward grunt work.
• There’s been a request to check if TIFF files share storage between tags, which is not allowed by the spec. This most often happens by design when two values, such as XResolution and YResolution, are known to have the same value, but could also indicate a corrupted file.
• There has been some discussion of checking whether TIFF tile and strip content goes outside file boundaries. There are limits on how rigorously that can be done, since lengths depend on compression methods that are outside JHOVE’s scope, but some idiot-proofing is possible.
All of this together is far more than two weeks of work, of course. Which of these are most important to you? What else should be on the list?
Registry browser update
I’ve made some changes to the format registry browser since yesterday. Changes include a help page, ability to use the “/” (slash) character in searches (very helpful when searching MIME types), and links to the registry entries from search results (not working right for PRONOM).
I attempted to make the search fields persist through a session, but that isn’t working, even though it works on the local emulation. Hopefully I’ll figure that out.
Google App Engine is a pain to work with, even though it’s free and has a number of simplifying features. It’s good for getting a quick demo up, though.
To get started, you need to get an Eclipse plugin from Google. Then you need to create a Google web application project, which needs to be in just the structure they want. It needs to have a top-level directory called “war,” and that needs to have a file called
WEB-INF/appengine-web.xml. If you’re starting a project from scratch, that’s not too heavy a requirement; other web application servers will just ignore that special file. But since I was working from an existing project, the differences were just enough that I had to create a separate Eclipse project for the Google version. Still, not too bad. The project is there and running. I don’t even need to run Ant; the plugin magically finds my classes. It also provides an emulation environment and simple uploading.This morning I was working on a few enhancements on the main line when the Google version spontaneously rebuilt itself. The console reported:
Now there were errors in Java files which are used only in the GUI version and reference AWT and Swing classes. An example: “
java.awt.Dimension is not supported by Google App Engine's Java runtime environment.” Fortunately, my code is clean enough that I could fix the problem by deleting a few classes and turning one into a stub. Still, such a pain. There certainly are web applications that use AWT for offscreen drawing, and they just won’t work with Google. There have been complaints about this.The environment is good enough for its purpose, but I wouldn’t try to do serious work with it.
Comments Off on Registry browser update
Posted in commentary
Tagged Google, software