Tag Archives: JHOVE

And … JHOVE 1.9b3

Lately I’ve been writing a user guide for JHOVE as part of an upcoming
book. This means going through all the features to see how they really
work, and this has turned up a number of bugs. Among the latest fixes
are are: (1) If the AIFF module encounters a little-endian file, it
treats all subsequent files as little-endian whether they are or not.
(2) Certain errors in WAVE files throw an exception from the module
instead of reporting that the file isn’t well-formed. (3) The XML
module’s “s” and “schema” parameters conflicted, with “schema” being
treated as both, and there was a problem with schema URIs with
upper-case characters.

Version 1.9b3 should fix all of these. Hopefully I won’t find anything
else that needs fixing soon, so we can finally have a 1.9 release. but
if there are any problems with this beta, please let me know!

JHOVE 1.9b2

JHOVE 1.9b2 is up, fixing issues with the configuration file. The code for editing the configuration file from the GUI was just completely broken, but I think it’s fixed now. I can’t imagine anyone was ever trying to add init strings to modules (none of the standard ones use one anyway) or add handlers using the GUI, or someone would already have noticed. But I couldn’t stand having it not fixed, so the new build is there.

Have some JHOVE with your turkey

The trouble with being self-employed is you’re still at work on the holidays.

I’ve given up on the Kickstarter project for JHOVE, since the lack of any feedback on what features users want suggests insufficient interest, but I’m continuing with the bug fixes. there’s a new JHOVE alpha release on SourceForge, mostly to fix some silly mistakes. Please report any problems.

Here’s the new portion of the release notes:

GENERAL

  1. Jhove.java and JhoveView.java now get their version information from JhoveBase.java. Before it was redundantly kept in three places, and sometimes they didn’t all get updated for a new release. Like in 1.8.
  2. ConfigWriter was in the package edu.harvard.hul.ois.jhove.viewer, which caused a NoClassDefFoundError if non-GUI configurations didn’t include JhoveViewer.jar in the classpath. It’s been moved to edu.harvard.hul.ois.jhove.
  3. Added script packagejhove.sh and made md5.pl part of the CVS repository to make packaging for delivery easier.

TIFF MODULE

  1. TIFF files that had strip or tile offsets but no corresponding byte counts were throwing an exception all the way to the top level. Now they’re correctly being reported as invalid.

XML MODULE

  1. Cleaned up reporting of schemas, Added some small classes to replace the use of string arrays for information structures.

Preliminary JHOVE fix list

I’m working toward a Kickstarter proposal that will cover what I’ll try to make two weeks of work. That will let me set a relatively modest funding goal, which seems wise for a first project. As I look at things more closely, a PDF 1.7 upgrade as such looks like the wrong way to go; I’m not seeing PDF 1.7 features that break JHOVE. What I’m seeing, rather, is an assortment of problems in the PDF module that can break files for reasons not closely tied to their version, and some features that would be very nice to add.

Here’s a first go-around on issues with JHOVE 1.8 which I’m considering addressing. I’m open to other issues as well. Comments on which of these are most important would help me to set up my project proposal.

• PDF files that open with Acrobat and with OS X Preview are being declared not well-formed or not valid because JHOVE is encountering one kind of PDF object when it expects another. Figuring this out may require some digging into the specs and implementation notes. Bug reports which I’ve added myself on SourceForge include Casting exceptions in PDF module, Problem with PDF annotation dictionaries, and PDF module doesn’t recognize all encryption algorithms.

• The PDF module recognizes PDF/X through version 3 but not 4. Adding a profile for PDF/X-4 looks simple.

• The PDF module recognizes PDF/A-1 but not 2 and 3. This could be a significant amount of work, possibly less if I incorporate a third-party library.

• The PDF module doesn’t recognize encryption algorithm 5, although it’s been around since PDF 1.5. The fix for this is easy.

• In general, PDF module error messages are less useful than they should be. More specific information and better logging would help in diagnosing any issues.

• The optional document requirements dictionary is a new feature in 1.7. This gives information on what an application needs to do to process the document correctly, which sounds like a useful preservation feature. Reporting a requirements property would be good.

• “Portable collections” of embedded files are a new feature in PDF 1.7. This sounds like a property worth reporting.

• AIFF files created by iTunes, and perhaps by other software, use the “ID3” chunk for metadata. The AIFF module knows nothing about it. Parsing this chunk and reporting the metadata might be useful. I’ve already written code that does this for another project.

• The UTF-8 module is at Unicode 6.0, and Unicode is at 6.2. Updating this is straightforward grunt work.

• There’s been a request to check if TIFF files share storage between tags, which is not allowed by the spec. This most often happens by design when two values, such as XResolution and YResolution, are known to have the same value, but could also indicate a corrupted file.

• There has been some discussion of checking whether TIFF tile and strip content goes outside file boundaries. There are limits on how rigorously that can be done, since lengths depend on compression methods that are outside JHOVE’s scope, but some idiot-proofing is possible.

All of this together is far more than two weeks of work, of course. Which of these are most important to you? What else should be on the list?

Expanding JHOVE

There are some significant improvements I’d like to make to JHOVE, to bring it up to date and improve its availability. The most important of these is to bring the PDF module up to version 1.7 (ISO 32000). I’ve done two releases since leaving Harvard, and download figures and feedback show there’s still significant interest. I’ve done that much to enhance my reputation, but I need to earn a living, and the PDF upgrade would be two or three weeks of solid work, so it has to be contingent on my getting compensated.

Features which look most important for JHOVE’s usual purposes include enhancements to Tagged PDF, Unicode file name references, new markup features, and dictionaries which support 3D artwork. I’m guessing there’s also interest in supporting PDF/A-2 and 3.

There’s probably no one institution right now willing to pay for the effort, but if it were possible to get a few hundred dollars from each of several institutions, it could work. One thought, of course, is Kickstarter, but I don’t know if institutional money can be funneled that way. Maybe it can and I just don’t know it. Alternatively, I can write application letters to the appropriate places, saying that I’ll do it if the amount pledged exceeds a certain threshold. No doubt it would take months for this to happen, but it seems possible in principle.

The idea could even be generalized to a library consortium for funding useful open source projects in return for support. Yes, I’m obviously thinking of how I can make money and I’m not apologizing for it. But the idea really could be useful. The SQLite consortium is a similar approach, focused on a single product.

Does anyone know of similar funding models that have worked, or alternative approaches that would achieve the result? Does the idea make sense or am I just blowing hot air?

JHOVE 1.8

I hadn’t heard any bug reports since 1.8 beta, which hopefully means it’s working smoothly for everyone, so I’ve now released JHOVE 1.8. Let me know ASAP if anything’s broken.

Release notes:

GENERAL

1. If JHOVE doesn’t find a configuration file, it creates a default one.

2. Generics widely added to clean up the code.

3. build.xml files fixed to force compilation to Java 1.5.

4. Shell script “jhove” no longer makes you figure out where JAVA_HOME is.

PDF MODULE

1. Several errors in checking for PDF-A compliance were corrected. Aside from fixing some outright bugs, the Contents key for non-text Annotations is no longer checked, as its presence is only recommended and not required.

2. Improved code by Hökan Svenson is now used for finding the trailer.

TIFF MODULE

1. TIFF tag 700 (XMP) now accepts field type 7 (UNDEFINED) as well as 1 (BYTE), on the basis of Adobe’s XMP spec, part 3.

2. If compression scheme 6 is used in a file, an InfoMessage will report that the file uses deprecated compression.

WAVE MODULE

1. The Originator Reference property, found in the Broadcast Wave Extension (BEXT) chunk, is now reported.

JHOVE 1.8b2

Oops… The Java 7 compiler on Ubuntu won’t build backwards-compatible classes, so JHOVE 1.8b1 wouldn’t run on earlier versions of Java. JHOVE 1.8b2 should fix the problem.

JHOVE 1.8 beta

A beta version of JHOVE 1.8 is now available for testing. Please report any problems. New stuff:

  • If JHOVE doesn’t find a configuration file, it creates a default one.
  • Generics widely added to clean up the code.
  • Several errors in checking for PDF-A compliance were corrected. Aside from
    fixing some outright bugs, the Contents key for non-text Annotations is
    no longer checked, as its presence is only recommended and not required.
  • Improved code by Hökan Svenson is now used for finding the trailer.
  • TIFF tag 700 (XMP) now accepts field type 7 (UNDEFINED) as well as 1
    (BYTE), on the basis of Adobe’s XMP spec, part 3.
  • If compression scheme 6 is used in a file, an InfoMessage will report
    that the file uses deprecated compression.
  • In WAVE files the Originator Reference property, found in the Broadcast Wave Extension
    (BEXT) chunk, is now reported.

JHOVE format notes

New on my business website: JHOVE format notes.