Category Archives: News

JHOVE app for OS X

I’ve packaged up JHOVEView 1.9 as an OS X application. It’s the same as the regular JHOVEView, except that it’s a little prettier. You can download it on SourceForge as JHOVEView_OSX.zip.

Optimizing FITS

January’s mostly over, and I’ve only posted three times to this blog. Files that Last has been keeping me busy. My posting should pick up again before long, once I get a draft out to first readers.

One thing I’ve been looking at, with an eye to the upcoming SPRUCE Hackathon, is things that can be done with FITS. I’ve written up the results of some profiling experiments and quick attempts at optimization. FITS puts together a lot of tools for extracting file metadata, but there have been some complaints that it’s not as fast as it might be. The first results were surprising; the easiest way to get a small improvement was to factor out the initialization of namespace URIs for parsing XML. You wouldn’t think that would make any detectable difference, but the initialization of URIs in Xerces is surprisingly slow.

Another possibility to explore is improving the connection between FITS and JHOVE. Even though JHOVE is intended for use as a callable library, among other things, it’s designed to write to an output file. Some simple changes would let it provide an in-memory response without writing a file, which would be more useful to an application like FITS.

A file format wiki

Last November Jason Scott and Dan Tobias led a one-month intensive “Just Solve the Problem” group effort, bringing in numerous people in the digital preservation world, to crowdsource information about file formats. By the end of the month there was a lot of information, but of course only so much can be done in a short time. After November updates went largely, but not completely, quiet.

This wiki has now become a permanent one, with a new URL. Here’s the announcement.

In a recent article in the Code4Lib Journal, I discussed the shortcomings of past approaches to building a file format registry. GDFR and UDFR were funded for a limited amount of time and had very ambitious designs, and they weren’t able to keep going. PRONOM has been more successful but also has trouble keeping up. The archiveteam.org format wiki uses existing tools and dispenses with formal structuring beyond what a wiki provides, and it could prove more viable in the long run. It’s also uneven and perhaps always will be, but it can keep improving as long as there are contributors.

New E-booklet: JHOVE Tips for Developers

My new E-booklet, JHOVE Tips for Developers, is now for sale on Smashwords.com. This was in part a trial run for publishing Files that Last, but anyone who integrates JHOVE with other software will find it useful. The chapters are:

  1. JHOVE Basics: A readable guide to installing, configuring, and running JHOVE, with information about each of the modules.
  2. The JHOVE API: Necessary information for integrating the JHOVE JAR into an application.
  3. Custom output: How to create a new output handler, for producing output in a different format or for better integration with an embedding application.
  4. Modules: Some supplemental information to the online tutorial on writing a module.

It’s a “name your own price” book. If you work with JHOVE and will have a use for the booklet, or if you just want to support JHOVE development, I hope you’ll buy it and pay a price you consider reasonable.

JHOVE 1.9

I’ve put up JHOVE 1.9 on the SourceForge site today. I think it’s the
least buggy version ever. Please let me know if I’m wrong.

Release notes:

GENERAL

  1. Jhove.java and JhoveView.java now get their version information from
    JhoveBase.java. Before it was redundantly kept in three places, and
    sometimes they didn’t all get updated for a new release. Like in 1.8.
  2. ConfigWriter was in the package edu.harvard.hul.ois.jhove.viewer, which
    caused a NoClassDefFoundError if non-GUI configurations didn’t include
    JhoveViewer.jar in the classpath. It’s been moved to
    edu.harvard.hul.ois.jhove.
  3. Added script packagejhove.sh and made md5.pl part of the CVS repository
    to make packaging for delivery easier.
  4. jhove.bat now simply uses the Java command rather than requiring
    the user to set up the Java path.
  5. JhoveView.jar and jhove (the top level shell script) are now forced
    by ant to be executable so there are no mistakes.
  6. Warning message given on invalid buffer size string, and minimum
    buffer size is 1024.
  7. Configuration file code for adding handlers and giving init strings
    to modules was an awful mess that never could have worked. Major repairs done.

AIFF MODULE

  1. If an AIFF file was found to be little-endian, the module instance
    would stay in little-endian mode for all subsequent files. This
    has been fixed.

TIFF MODULE

  1. TIFF files that had strip or tile offsets but no corresponding byte
    counts were throwing an exception all the way to the top level. Now
    they’re correctly being reported as invalid.

XML MODULE

  1. Cleaned up reporting of schemas, Added some small classes to replace
    the use of string arrays for information structures. Made URI comparison
    for local schema parameter case-independent. Resolved conflict between
    “s” and “schema” parameters.

WAVE MODULE

  1. Some uncaught exceptions caused the module to throw all the way
    back to JhoveBase and not report any result for certain defective
    files. These now report the file as not well-formed.

Kickstarter launch: Files That Last

It’s started! Today I’m launching a Kickstarter campaign to help fund the completion and publication of my e-book, Files That Last. Rather than repeat everything I’ve said on the Kickstarter page and the homepage for the book, I’ll say just enough to convince you, as someone who cares about formats and digital preservation, that it’s worth looking at those pages and considering helping to fund the book and spread the word.

Files That Last logoSo far there isn’t, as far as I know, a book to promote and explain digital preservation to people who understand computers but aren’t part of the library and archiving world. That’s where I’m aiming this book. If you look at the Library of Congress’s personal archiving pages, that gives you some idea of what I’m aiming at, though I’m also addressing nonprofit organizations and businesses. It’s not a book for programmers, but it will have enough technical detail to give an understanding of how formats, metadata, and media affect the longevity of files and how to make best use of them.

If you pledge $10, you’ll get an electronic copy of the book when it’s done (DRM-free, naturally). For just $100, you can use it as a classroom text and distribute it to up to 50 students!

If you want brief, regular updates on the project, add this URL to your RSS feed.

I’m counting on your support to help make this happen, whether you pledge money, spread the word, or both. I’m excited about getting the book out, and I think you will be too when you see it.

And … JHOVE 1.9b3

Lately I’ve been writing a user guide for JHOVE as part of an upcoming
book. This means going through all the features to see how they really
work, and this has turned up a number of bugs. Among the latest fixes
are are: (1) If the AIFF module encounters a little-endian file, it
treats all subsequent files as little-endian whether they are or not.
(2) Certain errors in WAVE files throw an exception from the module
instead of reporting that the file isn’t well-formed. (3) The XML
module’s “s” and “schema” parameters conflicted, with “schema” being
treated as both, and there was a problem with schema URIs with
upper-case characters.

Version 1.9b3 should fix all of these. Hopefully I won’t find anything
else that needs fixing soon, so we can finally have a 1.9 release. but
if there are any problems with this beta, please let me know!

JHOVE 1.9b2

JHOVE 1.9b2 is up, fixing issues with the configuration file. The code for editing the configuration file from the GUI was just completely broken, but I think it’s fixed now. I can’t imagine anyone was ever trying to add init strings to modules (none of the standard ones use one anyway) or add handlers using the GUI, or someone would already have noticed. But I couldn’t stand having it not fixed, so the new build is there.

Format registry browser online

In an effort to promote interest in my format registry browser, I’ve built a Java web application out of it and put it up on Google App Engine at regbrowser.appspot.com. It lets you search PRONOM, UDFR, and the DBPedia structured summaries of format articles, by name, MIME Type, creator, and extension. It uses SPARQL Linked Data queries to obtain data.

It’s still in a rough form; the point is to show what it can do and hopefully get some interest in putting money into further development. Obvious improvements, which I may do shortly, would include checkboxes for which repositories to search and retention of text fields when returning to the search page.

UDFR times out a lot. If you get a timeout error, trying again has a good chance of working.