Monthly Archives: January 2013

JHOVE statistics

Here are a few statistics on JHOVE, taken from SourceForge. The period I checked is from January 1, 2012, through January 29, 2013.

Total downloads, all files: 3,081
Downloads for Windows: 2,160
Linux: 350
Macintosh: 294

Top 5 countries:
United States: 831
Germany: 316
Spain: 235
France: 184
Canada: 129

Releases of JHOVE since I left Harvard: 2

Total income from JHOVE since I left Harvard: $12.70 (from sales of JHOVE Tips for Developers)

Optimizing FITS

January’s mostly over, and I’ve only posted three times to this blog. Files that Last has been keeping me busy. My posting should pick up again before long, once I get a draft out to first readers.

One thing I’ve been looking at, with an eye to the upcoming SPRUCE Hackathon, is things that can be done with FITS. I’ve written up the results of some profiling experiments and quick attempts at optimization. FITS puts together a lot of tools for extracting file metadata, but there have been some complaints that it’s not as fast as it might be. The first results were surprising; the easiest way to get a small improvement was to factor out the initialization of namespace URIs for parsing XML. You wouldn’t think that would make any detectable difference, but the initialization of URIs in Xerces is surprisingly slow.

Another possibility to explore is improving the connection between FITS and JHOVE. Even though JHOVE is intended for use as a callable library, among other things, it’s designed to write to an output file. Some simple changes would let it provide an in-memory response without writing a file, which would be more useful to an application like FITS.

A file format wiki

Last November Jason Scott and Dan Tobias led a one-month intensive “Just Solve the Problem” group effort, bringing in numerous people in the digital preservation world, to crowdsource information about file formats. By the end of the month there was a lot of information, but of course only so much can be done in a short time. After November updates went largely, but not completely, quiet.

This wiki has now become a permanent one, with a new URL. Here’s the announcement.

In a recent article in the Code4Lib Journal, I discussed the shortcomings of past approaches to building a file format registry. GDFR and UDFR were funded for a limited amount of time and had very ambitious designs, and they weren’t able to keep going. PRONOM has been more successful but also has trouble keeping up. The archiveteam.org format wiki uses existing tools and dispenses with formal structuring beyond what a wiki provides, and it could prove more viable in the long run. It’s also uneven and perhaps always will be, but it can keep improving as long as there are contributors.

“Digital forensics”

Now and then I see talk about “digital forensics.” It’s never clear what it’s supposed to mean. “Forensic” means “belonging to, used in, or suitable to courts of judicature or to public discussion and debate.” In popular usage, it’s generally applied to criminal investigations, especially in the phrase “forensic medicine.”

Some activities could be called digital forensics, where digital methods help to resolve contentious issues. For instance, textual analysis might shed light on an author’s identity. Digital techniques can even solve crimes. Too often, though, the term is getting stretched beyond meaningfulness, to the point that routine curation practices are called “forensics.”

No doubt it feels glamorous to think of oneself as the CSI of libraries, but let’s not get carried away with buzzwords.

New E-booklet: JHOVE Tips for Developers

My new E-booklet, JHOVE Tips for Developers, is now for sale on Smashwords.com. This was in part a trial run for publishing Files that Last, but anyone who integrates JHOVE with other software will find it useful. The chapters are:

  1. JHOVE Basics: A readable guide to installing, configuring, and running JHOVE, with information about each of the modules.
  2. The JHOVE API: Necessary information for integrating the JHOVE JAR into an application.
  3. Custom output: How to create a new output handler, for producing output in a different format or for better integration with an embedding application.
  4. Modules: Some supplemental information to the online tutorial on writing a module.

It’s a “name your own price” book. If you work with JHOVE and will have a use for the booklet, or if you just want to support JHOVE development, I hope you’ll buy it and pay a price you consider reasonable.