Tag Archives: JHOVE

JHOVE 1.10b2

I’ve put up JHOVE 1.10b2. It has a bit of optimization for the PDF module, though files with huge structure trees are still painfully slow.

JHOVE 1.10b1

I’ve put up a new beta version of JHOVE, 1.10b1, on SourceForge.

The major change since last time is the handling of structure trees in PDF files; this should keep JHOVE from hanging or running out of memory on some PDF files as it used to. Please report any problems soon.

A PDF question

A while back, I posted a question on superuser.com about a PDF issue that’s causing problems in JHOVE. So far it hasn’t gotten any answers, so I’m signal-boosting my own question here. Here’s what I asked:

The JHOVE parser for PDF, which I maintain, will sometimes find a non-dictionary object in a PDF’s Annots array. According to section 8.4.1 of the PDF spec, the Annots array holds “an array of annotation dictionaries.” In the case that I’m looking at right now, there’s a keyword of “Annot” instead of a dictionary. Is this an invalid PDF file, or is there a subtlety in the spec which I’ve overlooked?

Answering on stackoverflow.com is best, so other people can see the answer, but if you prefer to answer here, I’ll post or summarize any useful response, with attribution, as an answer over there.

JHOVE app for OS X

I’ve packaged up JHOVEView 1.9 as an OS X application. It’s the same as the regular JHOVEView, except that it’s a little prettier. You can download it on SourceForge as JHOVEView_OSX.zip.

Reaching out from L-space, part 2

(This is a continuation of Reaching out from L-Space.)

Let’s look more specifically at digital preservation. This is something that should be of interest to everyone, since we all have files that we want to keep around for a long time, such as photographs. Even so, it doesn’t get wide notice as an area of study outside libraries and archives. All the existing books about it are expensive academic volumes for specialists.

Efforts are being made. The Library of Congress has digitalpreservation.gov, which has a lot of information for the ordinary user. There’s the Personal Digital Archiving Conference, which is coming up shortly.

At PDA 2012, Mike Ashenfelder said in the keynote speech:

Today in 2012, most of the world’s leading cultural institutions are engaged in digital preservation of some sort, and we’re doing quite well after a decade. We have any number of meetings throughout the year — the ECDL, the JCDL, iPres, this — but despite this decade of institutional progress, we’ve neglected the general public, and that’s everybody.

Why hasn’t there been more of an effect from these efforts? One reason may be that they’re pitched at the wrong level, either too high or too low. Technical resources often aren’t user-friendly and are useful only to specialists. The Library of Congress’s efforts are aimed largely at end users, and it’s sometimes very basic and repetitive. A big issue is picking the right level to talk to. We need to engage non-library techies and not just stay inside L-space.

Let’s narrow the focus again and look at JHOVE. It’s a software tool that was developed at Harvard; the design was Stephen Abrams’, and I wrote most of the code. It identifies file formats, validates files, and extracts metadata. Its validation is strictly by the specification. Its error messages are often mysterious, and it doesn’t generally take into account the reality of what kinds of files are accepted. Postel’s law says, “Be conservative in what you do; be liberal in what you accept from others”; but JHOVE doesn’t follow this. As a validation tool, it does need to be on the conservative side, but it may go a bit too far.

JHOVE is useful for preservation specialists, but not so much for the general user. I haven’t tried to change its purpose; it has its user base and they know what to accept of it. There should also be tools, though, for a more general user base.

JHOVE leads to the issue of open source in general. As library software developers, we should be using and creating open-source code. We need to get input from users on what we’re doing. Bram de Werf wrote on the Open Planets Foundation blog:

You will read in most digital preservation survey reports that these same tools are not meeting the needs of the community. At conferences, you will hear complaints about the performance of the tools. BUT, most strikingly, when visiting the sites where these tools are downloadable for free, you will see no signs of an active user community reporting bugs and submitting feature requests. The forums are silent. The open source code is sometimes absent and there are neither community building approaches nor procedures in place for committing code to the open source project.

Creating a community where communication happens is a challenge. Users are shy about making requests and reporting bugs. I don’t have a lot of good answers here. With JHOVE, I’ve had limited success. There was an active community for a while; users not only reported bugs but often submitted working code that I just had to test and incorporate into the release. Now there’s less of that, perhaps because JHOVE has been around for a long time. An open source community requires proactive engagement; you can’t just create a project and expect input. Large projects like Mozilla manage to get a community; for smaller niche projects it’s harder.

Actually, the term “project” is a mistake if you think of it as getting a grant, creating some software, and being done with it. Community involvement needs to be ongoing. Some projects have come out of the development process with functioning code and then immediately died for lack of a community.

Let’s consider format repositories now. An important issue in preservation is figuring out the formats of mysterious files. Repositories with information about lots of different formats are a valuable tool for doing this. The most successful of these is PRONOM, from the UK National Archives. It has a lot of valuable information but also significant holes; the job is too big for one institution to keep up with.

To address this difficulty, there was a project called GDFR — the Global Digital Format Repository. Its idea was that there would be mirrored peer repositories at multiple institutions. This was undertaken by Harvard and OCLC. It never came to a successful finish; it was a very complex design, and there were some communication issues between OCLC and Harvard developers (including me).

A subsequent effort was UDFR, the Unified Digital Format Repository. This eliminated the complications of the mirrored design and delivered a functional website. It’s not a very useful site, though, because there isn’t a lot of format information on it. It wasn’t able to develop the critically necessary community.

A different approach was a project called “Just Solve the Problem.” Rather than developing new software, it uses a wiki. It started with a one-month crowdsourced effort to put together information on as many formats as possible, with pointers to detailed technical information on other sites rather than trying to include it all in the repository. It’s hard to say for sure yet, but this may prove to be a more effective way to create a viable repository.

The basic point here is that preservation outreach needs to be at people’s own level. So what am I doing about it? Well, I have an e-book coming out in April, called Files that Last. It’s aimed at “everygeek”; it assumes more than casual computer knowledge, but not specialization on the reader’s part. It addresses the issues with a focus on practical use. But so much for my book plug.

To recap: L-space is a subspace of “Worldspace,” and we need to reach out to it. We need to engage, and engage in, user communities. Software developers for the library need to reach a broad range of people. We need to start by understanding the knowledge they already have and address them at their level, in their language. We have to help them do things their way, but better.

Future paths for JHOVE

With the next SPRUCE Hackathon coming up, I’m thinking of possible ways to improve JHOVE that I might present there. The home page says, “This hackathon will therefore focus on unifying our community’s approach to characterisation by coordinating existing toolsets and improving their capabilities.” So aside from the general goal of improving JHOVE, coordination is a key point.

I’d posted earlier on some possible enhancements. These are all still possibilities. The focus on coordination brings up other things that could be done. In general, the API hasn’t been given as much thought as the command line interface, and it could be improved without a huge amount of effort. Here are a few thoughts:

  • The API currently requires creating an output stream, such as an XML or text file. It should be possible to call JHOVE and get back an in-memory object. The RepInfo object already serves this purpose; it’s mostly a matter of writing a new method that returns it instead of writing a stream.
  • The caller has the choice of running one module or all the modules in the configuration file and can’t change their order. It might improve efficiency if the caller could specify a list indicating the modules to try and the order in which they should be applied. For instance, a caller might use DROID to get the signature and use this information to pick the module that JHOVE should run first.
  • There’s currently no provision for selecting which output items to generate, except for a few ad hoc options. Would a way to do this, eliminating items that are unwanted, be helpful?
  • Would any additional output handlers, such as JSON, be useful?

I’d welcome any thoughts on which of these, or what other changes, would help JHOVE to coordinate with other applications.

JHOVE statistics

Here are a few statistics on JHOVE, taken from SourceForge. The period I checked is from January 1, 2012, through January 29, 2013.

Total downloads, all files: 3,081
Downloads for Windows: 2,160
Linux: 350
Macintosh: 294

Top 5 countries:
United States: 831
Germany: 316
Spain: 235
France: 184
Canada: 129

Releases of JHOVE since I left Harvard: 2

Total income from JHOVE since I left Harvard: $12.70 (from sales of JHOVE Tips for Developers)

New E-booklet: JHOVE Tips for Developers

My new E-booklet, JHOVE Tips for Developers, is now for sale on Smashwords.com. This was in part a trial run for publishing Files that Last, but anyone who integrates JHOVE with other software will find it useful. The chapters are:

  1. JHOVE Basics: A readable guide to installing, configuring, and running JHOVE, with information about each of the modules.
  2. The JHOVE API: Necessary information for integrating the JHOVE JAR into an application.
  3. Custom output: How to create a new output handler, for producing output in a different format or for better integration with an embedding application.
  4. Modules: Some supplemental information to the online tutorial on writing a module.

It’s a “name your own price” book. If you work with JHOVE and will have a use for the booklet, or if you just want to support JHOVE development, I hope you’ll buy it and pay a price you consider reasonable.

JHOVE Tips for Developers: Call for proofreaders

As a practice run for publishing Files that Last on Smashwords, I’ve put together a small but hopefully useful e-booklet, JHOVE Tips for Developers, which I’m planning to put up there on a “choose your own price” basis. This will help me work out the process of creating the book on a small scale, and maybe it will buy me a Whopper and fries.

For a book of this sort I obviously can’t afford paid proofreading, but I’m hoping one or two people might give it a looking over before I submit the book. You can get the draft as a PDF here.

I’d offer you a free copy in return, but you can get that anyway. What I can do is offer people who give useful feedback credit in the book, as well as my personal thanks.

JHOVE 1.9

I’ve put up JHOVE 1.9 on the SourceForge site today. I think it’s the
least buggy version ever. Please let me know if I’m wrong.

Release notes:

GENERAL

  1. Jhove.java and JhoveView.java now get their version information from
    JhoveBase.java. Before it was redundantly kept in three places, and
    sometimes they didn’t all get updated for a new release. Like in 1.8.
  2. ConfigWriter was in the package edu.harvard.hul.ois.jhove.viewer, which
    caused a NoClassDefFoundError if non-GUI configurations didn’t include
    JhoveViewer.jar in the classpath. It’s been moved to
    edu.harvard.hul.ois.jhove.
  3. Added script packagejhove.sh and made md5.pl part of the CVS repository
    to make packaging for delivery easier.
  4. jhove.bat now simply uses the Java command rather than requiring
    the user to set up the Java path.
  5. JhoveView.jar and jhove (the top level shell script) are now forced
    by ant to be executable so there are no mistakes.
  6. Warning message given on invalid buffer size string, and minimum
    buffer size is 1024.
  7. Configuration file code for adding handlers and giving init strings
    to modules was an awful mess that never could have worked. Major repairs done.

AIFF MODULE

  1. If an AIFF file was found to be little-endian, the module instance
    would stay in little-endian mode for all subsequent files. This
    has been fixed.

TIFF MODULE

  1. TIFF files that had strip or tile offsets but no corresponding byte
    counts were throwing an exception all the way to the top level. Now
    they’re correctly being reported as invalid.

XML MODULE

  1. Cleaned up reporting of schemas, Added some small classes to replace
    the use of string arrays for information structures. Made URI comparison
    for local schema parameter case-independent. Resolved conflict between
    “s” and “schema” parameters.

WAVE MODULE

  1. Some uncaught exceptions caused the module to throw all the way
    back to JhoveBase and not report any result for certain defective
    files. These now report the file as not well-formed.