Optimizing FITS

January’s mostly over, and I’ve only posted three times to this blog. Files that Last has been keeping me busy. My posting should pick up again before long, once I get a draft out to first readers.

One thing I’ve been looking at, with an eye to the upcoming SPRUCE Hackathon, is things that can be done with FITS. I’ve written up the results of some profiling experiments and quick attempts at optimization. FITS puts together a lot of tools for extracting file metadata, but there have been some complaints that it’s not as fast as it might be. The first results were surprising; the easiest way to get a small improvement was to factor out the initialization of namespace URIs for parsing XML. You wouldn’t think that would make any detectable difference, but the initialization of URIs in Xerces is surprisingly slow.

Another possibility to explore is improving the connection between FITS and JHOVE. Even though JHOVE is intended for use as a callable library, among other things, it’s designed to write to an output file. Some simple changes would let it provide an in-memory response without writing a file, which would be more useful to an application like FITS.

A file format wiki

Last November Jason Scott and Dan Tobias led a one-month intensive “Just Solve the Problem” group effort, bringing in numerous people in the digital preservation world, to crowdsource information about file formats. By the end of the month there was a lot of information, but of course only so much can be done in a short time. After November updates went largely, but not completely, quiet.

This wiki has now become a permanent one, with a new URL. Here’s the announcement.

In a recent article in the Code4Lib Journal, I discussed the shortcomings of past approaches to building a file format registry. GDFR and UDFR were funded for a limited amount of time and had very ambitious designs, and they weren’t able to keep going. PRONOM has been more successful but also has trouble keeping up. The archiveteam.org format wiki uses existing tools and dispenses with formal structuring beyond what a wiki provides, and it could prove more viable in the long run. It’s also uneven and perhaps always will be, but it can keep improving as long as there are contributors.

“Digital forensics”

Now and then I see talk about “digital forensics.” It’s never clear what it’s supposed to mean. “Forensic” means “belonging to, used in, or suitable to courts of judicature or to public discussion and debate.” In popular usage, it’s generally applied to criminal investigations, especially in the phrase “forensic medicine.”

Some activities could be called digital forensics, where digital methods help to resolve contentious issues. For instance, textual analysis might shed light on an author’s identity. Digital techniques can even solve crimes. Too often, though, the term is getting stretched beyond meaningfulness, to the point that routine curation practices are called “forensics.”

No doubt it feels glamorous to think of oneself as the CSI of libraries, but let’s not get carried away with buzzwords.

New E-booklet: JHOVE Tips for Developers

My new E-booklet, JHOVE Tips for Developers, is now for sale on Smashwords.com. This was in part a trial run for publishing Files that Last, but anyone who integrates JHOVE with other software will find it useful. The chapters are:

  1. JHOVE Basics: A readable guide to installing, configuring, and running JHOVE, with information about each of the modules.
  2. The JHOVE API: Necessary information for integrating the JHOVE JAR into an application.
  3. Custom output: How to create a new output handler, for producing output in a different format or for better integration with an embedding application.
  4. Modules: Some supplemental information to the online tutorial on writing a module.

It’s a “name your own price” book. If you work with JHOVE and will have a use for the booklet, or if you just want to support JHOVE development, I hope you’ll buy it and pay a price you consider reasonable.

A preservation hazard in OpenOffice

While playing with OpenOffice in my research for Files that Last, I came across a preservation risk. I copied an image from a website and pasted it into a text document, then looked at the resulting XML. The image data wasn’t anywhere in content.xml or anywhere else in the overall ZIP document. Instead, I found this:


<draw:image
xlink:href="http://plan-b-for-openoffice.org/resources/images/x180x60_3_get.png.pagespeed.ic.fjV0teeVb_.png"
xlink:type="simple"
xlink:show="embed"
xlink:actuate="onLoad"/>

The source for the image is on the Web. This means that if the URL stops working, the document loses the image. That’s a poor plan for long-term storage.

The way to avoid this is to use Edit > Paste special and paste the image as a bitmap. It can be a pain to remember to do this. You may be able to catch images that are pasted by reference, since there can be a brief delay while just a box with the URL is displayed before the image comes up.

Sneaky little preservation hazards like this (and the earlier one mentioned with Adobe Illustrator files) are the kind of thing you’ll find when Files that Last comes out.

JHOVE Tips for Developers: Call for proofreaders

As a practice run for publishing Files that Last on Smashwords, I’ve put together a small but hopefully useful e-booklet, JHOVE Tips for Developers, which I’m planning to put up there on a “choose your own price” basis. This will help me work out the process of creating the book on a small scale, and maybe it will buy me a Whopper and fries.

For a book of this sort I obviously can’t afford paid proofreading, but I’m hoping one or two people might give it a looking over before I submit the book. You can get the draft as a PDF here.

I’d offer you a free copy in return, but you can get that anyway. What I can do is offer people who give useful feedback credit in the book, as well as my personal thanks.

When is a PDF not a PDF?

Yesterday I was doing some experiments with Adobe Illustrator. According to some web sites, The CS5 version saves its files as PDF, though with the extension .AI. When you save a file, though, the options dialog has a checkbox labeled “Create PDF Compatible File.” I unchecked it and saved the file, then opened it in JHOVE. JHOVE says it’s perfectly good PDF — indeed, PDF/A. Then I tried opening it in Preview, and this is what it looked like:

File says over and over that it was saved without PDF content

If you don’t actually look at the file but trust the mere fact that it’s a PDF, you might put it into a repository and find out later on that it’s worthless as a PDF. What’s happening is that PDF can embed any kind of content, and this one embeds its native PGF data. Any PDF reader can open the file, but only an application that understands PGF can use its actual content. Anyone putting PDF into a repository should be aware of this risk.

It’s outside the scope of JHOVE to check whether embedded content is acceptable to PDF/A, so the claim that it’s correct PDF/A is probably spurious. It is, however, definitely legal PDF.

This type of situation helps to show why PDF/A-3 is a bad idea.

JHOVE 1.9

I’ve put up JHOVE 1.9 on the SourceForge site today. I think it’s the
least buggy version ever. Please let me know if I’m wrong.

Release notes:

GENERAL

  1. Jhove.java and JhoveView.java now get their version information from
    JhoveBase.java. Before it was redundantly kept in three places, and
    sometimes they didn’t all get updated for a new release. Like in 1.8.
  2. ConfigWriter was in the package edu.harvard.hul.ois.jhove.viewer, which
    caused a NoClassDefFoundError if non-GUI configurations didn’t include
    JhoveViewer.jar in the classpath. It’s been moved to
    edu.harvard.hul.ois.jhove.
  3. Added script packagejhove.sh and made md5.pl part of the CVS repository
    to make packaging for delivery easier.
  4. jhove.bat now simply uses the Java command rather than requiring
    the user to set up the Java path.
  5. JhoveView.jar and jhove (the top level shell script) are now forced
    by ant to be executable so there are no mistakes.
  6. Warning message given on invalid buffer size string, and minimum
    buffer size is 1024.
  7. Configuration file code for adding handlers and giving init strings
    to modules was an awful mess that never could have worked. Major repairs done.

AIFF MODULE

  1. If an AIFF file was found to be little-endian, the module instance
    would stay in little-endian mode for all subsequent files. This
    has been fixed.

TIFF MODULE

  1. TIFF files that had strip or tile offsets but no corresponding byte
    counts were throwing an exception all the way to the top level. Now
    they’re correctly being reported as invalid.

XML MODULE

  1. Cleaned up reporting of schemas, Added some small classes to replace
    the use of string arrays for information structures. Made URI comparison
    for local schema parameter case-independent. Resolved conflict between
    “s” and “schema” parameters.

WAVE MODULE

  1. Some uncaught exceptions caused the module to throw all the way
    back to JhoveBase and not report any result for certain defective
    files. These now report the file as not well-formed.

Digital preservation song

My daily update on the Files that Last blog includes a new song about digital preservation. It’s to promote my Kickstarter campaign for Files that Last and shares the book’s title, but you might find it fun in its own right. Naturally there’s a WAVE file in addition to the MP3. Links are appreciated.

Kickstarter launch: Files That Last

It’s started! Today I’m launching a Kickstarter campaign to help fund the completion and publication of my e-book, Files That Last. Rather than repeat everything I’ve said on the Kickstarter page and the homepage for the book, I’ll say just enough to convince you, as someone who cares about formats and digital preservation, that it’s worth looking at those pages and considering helping to fund the book and spread the word.

Files That Last logoSo far there isn’t, as far as I know, a book to promote and explain digital preservation to people who understand computers but aren’t part of the library and archiving world. That’s where I’m aiming this book. If you look at the Library of Congress’s personal archiving pages, that gives you some idea of what I’m aiming at, though I’m also addressing nonprofit organizations and businesses. It’s not a book for programmers, but it will have enough technical detail to give an understanding of how formats, metadata, and media affect the longevity of files and how to make best use of them.

If you pledge $10, you’ll get an electronic copy of the book when it’s done (DRM-free, naturally). For just $100, you can use it as a classroom text and distribute it to up to 50 students!

If you want brief, regular updates on the project, add this URL to your RSS feed.

I’m counting on your support to help make this happen, whether you pledge money, spread the word, or both. I’m excited about getting the book out, and I think you will be too when you see it.