File identification tools, part 9: JHOVE2

The story of JHOVE2 is a rather sad one, but I need to include it in this series. As the name suggests, it was supposed to be the next generation of JHOVE. Stephen Abrams, the creator of JHOVE (I only implemented the code), was still at Harvard, and so was I. I would have enjoyed working on it, getting things right that the first version got wrong. However, Stephen accepted a position with the California Digital Library (CDL), and that put an end to Harvard’s participation in the project. I thought about applying for a position in California but decided I didn’t want to move west. I was on the advisory board but didn’t really do much, and I had no involvement in the programming. I’m not saying I could have written JHOVE2 better, just explaining my relationship to the project. JHOVE2 logo

The institutions that did work on it were CDL, Portico, and Stanford University. There were two problems with the project. The big one was insufficient funding; the money ran out before JHOVE2 could boast a set of modules comparable to JHOVE. A secondary problem was usability. It’s complex and difficult to work with. I think if I’d been working on the project, I could have helped to mitigate this. I did, after all, add a GUI to JHOVE when Stephen wasn’t looking.

JHOVE has some problems that needed fixing. It quits its analysis on the first error. It’s unforgiving on identification; a TIFF file with a validation error simply isn’t a TIFF file, as far as it’s concerned. Its architecture doesn’t readily accommodate multi-file documents. It deals with embedded formats only on a special-case basis (e.g., Exif metadata in non-TIFF files). Its profile identification is an afterthought. JHOVE2 provided better ways to deal with these issues. The developers wrote it from scratch, and it didn’t aim for any kind of compatibility with JHOVE.

JHOVE2 is available as open-source software under the BSD license. The source code is on Bitbucket. Version 2.1.0 requires Java 6 or higher and, if the SGML module is used, the OpenSP SGML parser. It supports the ARC, GZIP, ICC color profile, SGML, Shapefile, TIFF, UTF-8, WARC, WAVE, and XML formats. NetCDF has “third-party development underway.” Three of these formats (ARC, GZIP, and WARC) are package formats for holding other files, taking advantage of JHOVE2’s design for processing nested content. The Shapefile module is an example of processing multi-file documents. There’s also an “Identifier” module which runes the DROID 6 identifier. PDF was on the schedule but still isn’t supported. (PDF is tough.)

The user guide gives an idea of the difficulty in using it. The installation section is over eleven pages long, and configuration is eight and a half pages. The assessment rule feature is powerful but the rule language is complex. Getting JHOVE2 to work in a production environment takes a serious commitment.

It’s not clear how widely used JHOVE2 is. I haven’t heard anything from libraries or archives that incorporate it into their production workflow. A query on Twitter resulted in several retweets but no responses. With a few more modules and some work on ease of use, it might have eclipsed JHOVE as it should have.

Update: The Bibliothèque Nationale de France mentions using JHOVE2 for characterizing Internet archive files.

Next: TBA. To read this series from the beginning, start here.

Comments are closed.