In 2004, the Harvard University Libraries engaged me as a contractor to write the code for JHOVE under Stephen Abrams’ direction. I stayed around as an employee for eight more years. I mention this because I might be biased about JHOVE: I know about its bugs, how hard it is to install, what design decisions could have been better, and how spotty my support for it has been. Still, people keep downloading it, using it, and saying good things about it, so I must have done something right. Do any programmers trust the code they wrote ten years ago?
The current home of JHOVE is on GitHub under the Open Preservation Foundation, which has taken over maintenance of it from me. Documentation is on the OPF website. I urge people not to download it from SourceForge; it’s out of date there, and there have been reports of questionable practices by SourceForge’s current management. The latest version as of this writing is 1.11.
JHOVE stands for “JSTOR/Harvard Object Validation Environment,” though neither JSTOR nor Harvard is directly involved with it any longer. It identifies and validates files in a small set of formats, so it’s not a general-purpose identification tool, but does a fairly thorough job. The formats it validates are AIFF, GIF, HTML, JPEG, JPEG2000, PDF, TIFF, WAV, XML, ASCII, and UTF-8. If it doesn’t recognize a file as any of those formats, it will call it a “Bytestream.” You can use JHOVE as a GUI or command line application, or as a Java library. If you’re going to use the library or otherwise do complicated things, I recommend downloading my payment-optional e-book, JHOVE Tips for Developers. Installation and configuration are tricky, so follow the instructions carefully and take your time.
JHOVE shouldn’t be confused with JHOVE2, which has similar aims to JHOVE but has a completely different code base, API, and user interface. It didn’t get as much funding as its creators hoped, so it doesn’t cover all the formats that JHOVE does.
Key concepts in JHOVE are “well-formed” and “valid.” When allowed to run all modules, it will always report a file is a valid instance of something; it’s a valid bytestream if it’s not anything else. This has confused some people; a valid bytestream is nothing more than a sequence of zero or more bytes. Everything is a valid bytestream.
The concept of well-formed and valid files comes from XML. A well-formed XML file obeys the syntactic rules; a valid one conforms to a schema or DTD. JHOVE applies this concept to other formats, but it’s generally not as good a fit. Roughly, a file which is “well-formed but not valid” has errors, but not ones that should prevent rendering.
JHOVE doesn’t examine all aspects of a file. It doesn’t examine data streams within files or deal with encryption. It focuses on the semantics of a file rather than its content. However, it’s very aggressive in what it does examine, so that sometimes it will declare a file not valid when nearly all rendering software will process it correctly. If there’s a conflict between the format specification and generally accepted practice, it usually goes by the specification.
It checks for profiles within a format, such as PDF/A and TIFF/IT. It only reports full conformance to a profile, so if a file is intended to be TIFF/A but fails any tests for the profile, JHOVE will simply not list PDF/A as a profile. It won’t tell you why it fell short.
The PDF module has been the biggest adventure; PDF is really complicated, and its complexity has increased with each release. Bugs continue to turn up, and it covers PDF only through version 1.6. It needs to be updated for 1.7, which is equivalent to ISO 32000.
Sorry, I warned you that I’m JHOVE’s toughest critic. But I wouldn’t mind a chance to improve it a bit, through the funding mechanism I mentioned earlier in the blog.
Next: FITS. To read this series from the beginning, start here.
> Next: I’m open to suggestions.
This is weird that you don’t think to MediaInfo when you speak of file identification tools.
Because lot of people, including several libraries, think about it when they want a file idnetification tool and actually use it. It will also be included in FITS ( http://projects.iq.harvard.edu/fits/news/video-support-fits ).
I know you don’t like it (“I’m really not very impressed” at https://fileformats.wordpress.com/2013/04/11/video-md/ ), but silently skipping it is a bit unfair…
BTW, http://www.preforma-project.eu is funding for preservation software development, including validation tools focusing on TIFF/PDF/Matroska/FFV1/PCM.
I’ll consider including it in the series, but responding to a request for suggestions with a complaint that it hasn’t already been included among the first four isn’t the best way to to motivate me.
In my 2013 post, I was addressing MediaInfo’s metadata model as one to adopt, and I didn’t consider it a good choice because of its lack of consistency. This is a separate question from how good it may be as an identification tool.