Monthly Archives: December 2010

Secrets of building JHOVE2

The current beta of JHOVE2 is rather tricky to build. With some help from Marisa Strong, I’ve managed to do it. Here’s a guide which may be helpful.

1. Download JHOVE2. If you have Mercurial, follow the instructions. Otherwise use the “Get Source” menu item to get the .gz file.

2. Get a current version of Maven if you don’t have one.

3. If got the gzip file, expand it and the tarball which it contains. This will create a main directory.

4. cd main. The first recommendation is to run mv compile, but this apparently requires an environment which isn’t released yet, so instead do

mvn assembly:assembly -DskipTests

5. cd into the target directory. This will have the file Unzip this in place.

6. The directory jhove2-2.0.0 was just created. cd into it. This contains the script Run this from the command line with no arguments, and you’ll get a usage message if everything worked correctly.

To do stuff with JHOVE2, the user guide (PDF) is helpful.

New look

I’ve changed over this blog to WordPress’s “Garland” theme. It makes better use of screen width, and I think it looks nicer in general. The CSS may get some tweaking for better contrast of text against background.

Discussion of file format registries

Andy Jackson, blogging with the Open Planets Foundation, has an interesting post on where format registries should be going.

PDF/A-2 ratified

This time it’s from the PDF/A Competence Center, so I’m pretty sure it’s real: On November 30, the committee for ISO 19005 met in Ottawa and ratified Part 2 of IDO 19005, aka PDF/A. PDF/A is a restricted profile for PDF which is designed to guarantee long-term usability of conforming files.

The previous version, PDF/A-1, was based on PDF 1.4. This is based on ISO 32000-1, which is equivalent to PDF 1.7. Valid PDF/A-1 files are also valid under PDF/A-2.

ISO 19005:2005, or PDF/A-1, is available for purchase from ISO, but as of this writing the new one, which presumably will be ISO 19005:2010, isn’t being offered online yet.

I can’t make any promises about when JHOVE will support PDF/A-2, if ever. Any work I do on it is on my own time. Of course, if someone else wants to run with it, the source is there and I can answer questions.

Misadventures in XML

Around 6 PM yesterday, our SMIL file delivery broke. At first I figured it for a database connection problem, but the log entries were atypical. I soon determined that retrieval of the SMIL DTD was regularly failing. Most requests would get an error, and those that did succeed took over a minute.

There’s a basic flaw in XML DTD’s and schemas (collectively called grammars). They’re identified by a URL, and by default any parser that validates documents by their grammar retrieves it from that URL. For popular ones, that means a lot of traffic. We’ve run into that problem with the JHOVE configuration schema, and that’s nowhere near the traffic a really popular schema must generate.

Knowing this, and also knowing that depending on an outside website’s staying up is a bad idea, we’ve made our own local copy of the SMIL DTD to reference. So I was extremely puzzled about why access to it had become so terrible. After much headscratching, I discovered a bug in the code that kept the redirection to the local DTD from working; we had been going to the official URL, which lives on, all along.

Presumably W3C is constantly hammered by requests for grammars which it originates, and presumably it’s fighting back by greatly lowering the priority of the worst offenders. Its server wasn’t blocking the requests altogether; that would have been easier to diagnose. The priority just got so low that most requests timed out.

Once I figured that out, I put in the fix to access the local DTD URL, and things are looking nicer now. Moving the fix to production will take a couple of days but should be routine.

The problem is inherent in XML: The definition of grammars is tied to a specific Web location. Aside from the problem of heavy traffic to there, this means the longevity of the grammar is tied to the longevity of the URL. It takes extra effort to make a local copy, and anyone starting out isn’t likely to encounter throttling right away, so the law of least effort says most people won’t bother to.

This got me wondering, as I started writing this post, why don’t parsers like Xerces cache grammars? It turns out that Xerces can cache grammars, though by default it doesn’t. As far as I can tell, this isn’t a well-known feature, and again the law of least effort implies that a lot of developers won’t take advantage of it. But it looks like a very useful thing. It should really be enabled by default, though I can understand why its implementers took the more cautious approach.

JHOVE2 goes to beta

The JHOVE2 team has announced a beta release:

This beta code release supports all the major technical objectives of the project, including a more sophisticated, modular architecture; signature-based file identification; policy-based assessment of objects; recursive characterization of objects comprising aggregate files and files arbitrarily nested in containers; and extensive configuration and reporting options. The release also continues to fill out the roster of supported formats, with modules for ICC color profiles, SGML, Shapefile, TIFF, UTF-8, WAVE, and XML.

The source code page provides the source as a Mercurial repository, or as a single download. The gzip download expands into a file called main-14e8a6102f63 and it isn’t at all obvious what to do with it. Chmoding it to an executable and running it doesn’t work. I’ve asked what this is supposed to be; I’ll update this post when I get a response.

Update: That’s a tarball. Adding the .tar extension and using tar -xvf works nicely.