Tag Archives: jhove2

File identification tools, part 9: JHOVE2

The story of JHOVE2 is a rather sad one, but I need to include it in this series. As the name suggests, it was supposed to be the next generation of JHOVE. Stephen Abrams, the creator of JHOVE (I only implemented the code), was still at Harvard, and so was I. I would have enjoyed working on it, getting things right that the first version got wrong. However, Stephen accepted a position with the California Digital Library (CDL), and that put an end to Harvard’s participation in the project. I thought about applying for a position in California but decided I didn’t want to move west. I was on the advisory board but didn’t really do much, and I had no involvement in the programming. I’m not saying I could have written JHOVE2 better, just explaining my relationship to the project. JHOVE2 logo

The institutions that did work on it were CDL, Portico, and Stanford University. There were two problems with the project. The big one was insufficient funding; the money ran out before JHOVE2 could boast a set of modules comparable to JHOVE. A secondary problem was usability. It’s complex and difficult to work with. I think if I’d been working on the project, I could have helped to mitigate this. I did, after all, add a GUI to JHOVE when Stephen wasn’t looking.

JHOVE has some problems that needed fixing. It quits its analysis on the first error. It’s unforgiving on identification; a TIFF file with a validation error simply isn’t a TIFF file, as far as it’s concerned. Its architecture doesn’t readily accommodate multi-file documents. It deals with embedded formats only on a special-case basis (e.g., Exif metadata in non-TIFF files). Its profile identification is an afterthought. JHOVE2 provided better ways to deal with these issues. The developers wrote it from scratch, and it didn’t aim for any kind of compatibility with JHOVE.
Continue reading

A new home for JHOVE

Over a decade ago, the Harvard University Libraries took me on as a contractor to start work on JHOVE. Later I became an employee, and JHOVE formed an important part of my work. When I left Harvard, I asked for continued “custody” of JHOVE so I could keep maintaining it, and got it. Over time it became less of a priority for me; there’s only so much time you can devote to something when no one’s paying you to do it.

After a long period of discussion, the Open Preservation Foundation (formerly the Open Planets Foundation) has taken up support of JHOVE. In addition to picking up the open source software, it’s resolved copyright issues in the documentation with Harvard, really over boilerplate that no one intended to enforced, but still an issue that had to be cleared.

Stephen Abrams, who was the real father of JHOVE, said, “We’re very pleased to see this transfer of stewardship responsibility for JHOVE to the OPF. It will ensure the continuity of maintenance, enhancement, and availability between the original JHOVE system and its successor JHOVE2, both key infrastructural components in wide use throughout the digital library community.”

JHOVE2 was originally supposed to be the successor to JHOVE, but it didn’t get enough funding to cover all the formats that JHOVE covers, so both are used, and the confusion of names is unfortunate. OPF has both in its portfolio. It doesn’t appear to have forked JHOVE to its Github repository yet, but I’m sure that’s coming soon.

My own Github repository for JHOVE should now be considered archival. Go forth and prosper, JHOVE.

JHOVE2 2.1.0

It’s been a long wait, but version 2.1.0 of JHOVE2 is now out! Sheila Morrissey writes:

Version 2.1.0 of JHOVE2 includes 3 new format modules, 1 new identifier module, 1 new displayer module, and several bug fixes and enhancements from the Issues page on the JHOVE2 wiki.

The new format modules included in this release are for the ARC, WARC, and GZIP formats.

The new Identifier module uses the UNIX “file” utility, giving JHOVE2 users the choice of employing either DROID or file for identification of file formats.

The new XSLDisplayer module (which extends XMLDisplayer) can do XSLT transformations on the XML output before displaying it.

This release also reflects a new milestone in the JHOVE2 development community. The new format and identifier modules are the contribution of developers from institutions (Bibliothéque Nationale de France and NETARKIVET.DK) beyond the original project participants (California Digital Library, Portico, and Stanford University Libraries).

The release notes are available on the project site.

Congratulations to everyone who helped bring this release out!

Getting JHOVE2 to build

There’s a private beta, which should soon be public, of a digital preservation area on StackExchange.com. I took advantage of my invitation to it to ask about something that had stalled me a while ago when I tried to download and build JHOVE2. A quick reply told me that the needed change is simple, just one line in the pom.xml file. I can’t link to my question and the answer on Stack Exchange, since a login is required to view it, but it turns out this issue had already been brought up in a JHOVE2 ticket. The discussion indicates some confusion about whether the issue has been fixed in the main JHOVE2 repository, but Sheila Morrissey has a fork on Bitbucket with the fix.

The fix is to change the URL for “JBoss Repository” in pom.xml to the following:

<url>https://repository.jboss.org/nexus/content/repositories/thirdparty-releases/</url >

Kevin Clarke, who provided the answer, recommends building with the following command line to avoid error messages in the tests:

mvn -DskipTests=true install

Contributors to JHOVE2

The JHOVE2 project has issued a governance document (PDF) for contributors to the JHOVE2 project. Stephen Abrams writes that “we believe it important to enlist the efforts of the wider user community in future efforts. Working collectively, we can most effectively take advantage of opportunities to enhance and extend the utility of JHOVE2, especially in times of significant constraints on local institutional resources.”

Workshop on preservation and JHOVE2

A workshop on digital preservation and JHOVE2 will be held at FAO (Food and Agriculture Organization of the United Nations) in Rome, Italy on May 23-27. Presenters will include Stephen Abrams and Perry Willett from California Digital Library, Tom Cramer from Stanford, and Sheila Morrissey from Portico. Days 1 and 2 (on preservation) are free; there is a $300 fee for the JHOVE2 tutorial.

JHOVE2 2.0.0

JHOVE2 2.0.0 has been released. Supported formats are ICC Color Profile, SGML, Shapefile, TIFF, UTF-8, WAVE, and XML. The first three of these aren’t supported by the old JHOVE. There’s also a Zip module which validates files within a Zip repository, but not the Zip file itself. JHOVE2 can be downloaded in Zip or Gzip form, or from the Mercurial repository.

Congratulations to everyone who worked on this project!

JHOVE2 tutorial at IS&T Archiving

Forwarded from Stephen Abrams:

The JHOVE2 project team will be presenting a one day tutorial on the use of JHOVE2 at the IS&T Archiving conference on May 16.

http://www.imaging.org/ist/conferences/archiving/index.cfm

Description

JHOVE2 is an open source framework and application for next generation format-aware characterization of digital objects. Characterization is the process of deriving representation information about a formatted digital object that is indicative of its significant nature and useful for purposes of classification, analysis, and use in digital curation, preservation, and repository contexts. JHOVE2 builds on the success of the original JHOVE characterization tool by addressing known limitations and offering significant new functions, including: object-focused, rather than file-focused, characterization; signature-based file level identification using DROID; aggregate-level identification based on configurable file system naming conventions; rules-based assessment to support determinations of object acceptability in addition to validation conformity; and extensive user configuration options.

The 2011 release of JHOVE2 represents the availability of a significant new tool for digital preservation; this course will provide a broad overview of JHOVE2, as well as detailed information on its functionality, architecture, use in local workflows, and open source community.

Course Objectives:

This short course will give attendees both a broad conceptual overview and detailed information on JHOVE2, and equip them to use the open source tool in their local environments. Specifically, the course will:

  • Define the role of file characterization, including identification, feature extraction, validation, and assessment, in digital curation and preservation workflows.
  • Review the functionality of the JHOVE2 application, including the significant enhancements relative to JHOVE, and new capabilities based on object- and aggregate-level characterization
  • Detail the architecture, componentry, design patterns and Java API’s of the JHOVE2 framework, as well as the configuration options for plug-in modules, characterization strategies and results formatting
  • Demonstrate the use of JHOVE2’s new rule-based assessment capabilities, and integrating these into local workflows to determine object acceptability
  • Cover the community framework for the project, and how individual institutions can both contribute new format modules as well as resources to help extend and sustain the open source project.

Intended Audience:

This course is designed for technologists and practitioners (developers, managers, analysts and administrators) engaged in digital curation, preservation, and repository activities, and whose work is dependent on an understanding of the format and pertinent characteristics of digital assets.

Secrets of building JHOVE2

The current beta of JHOVE2 is rather tricky to build. With some help from Marisa Strong, I’ve managed to do it. Here’s a guide which may be helpful.

1. Download JHOVE2. If you have Mercurial, follow the instructions. Otherwise use the “Get Source” menu item to get the .gz file.

2. Get a current version of Maven if you don’t have one.

3. If got the gzip file, expand it and the tarball which it contains. This will create a main directory.

4. cd main. The first recommendation is to run mv compile, but this apparently requires an environment which isn’t released yet, so instead do

mvn assembly:assembly -DskipTests

5. cd into the target directory. This will have the file jhove2-2.0.0.zip. Unzip this in place.

6. The directory jhove2-2.0.0 was just created. cd into it. This contains the script jhove2.sh. Run this from the command line with no arguments, and you’ll get a usage message if everything worked correctly.

To do stuff with JHOVE2, the user guide (PDF) is helpful.

JHOVE2 goes to beta

The JHOVE2 team has announced a beta release:

This beta code release supports all the major technical objectives of the project, including a more sophisticated, modular architecture; signature-based file identification; policy-based assessment of objects; recursive characterization of objects comprising aggregate files and files arbitrarily nested in containers; and extensive configuration and reporting options. The release also continues to fill out the roster of supported formats, with modules for ICC color profiles, SGML, Shapefile, TIFF, UTF-8, WAVE, and XML.

The source code page provides the source as a Mercurial repository, or as a single download. The gzip download expands into a file called main-14e8a6102f63 and it isn’t at all obvious what to do with it. Chmoding it to an executable and running it doesn’t work. I’ve asked what this is supposed to be; I’ll update this post when I get a response.

Update: That’s a tarball. Adding the .tar extension and using tar -xvf works nicely.