Monthly Archives: November 2012

The disappearing format blues

Old formats sometimes fade into obscurity and can no longer be supported, even if they come from a big company like Microsoft. Chris Rusbridge has noted that Microsoft’s Open Specifications page only goes as far back as Office 97, and that PowerPoint 4.0 files can’t be opened with today’s Microsoft Office. Tony Hey at Microsoft has replied. (Hey is vice president of Microsoft Research Connections). The response was encouraging, particularly in suggesting that Microsoft might “participate in a ‘crowd source’ project working with archivists to create a public spec of these old file formats.”

There’s usually some kind of software around that can read old formats. A search turns doesn’t turn up a lot; there’s something called PowerPressed, which will wrap old PowerPoint files in a .exe application. It looks as if it should run on current Windows systems, but all I know is what that page says.

The situation shows the risk of using a format that isn’t publicly documented. Today this is less of a problem. I think it’s been shown that publishing format specs doesn’t lead to cannibalization of sales by competing software; the company that created the spec is in a position to produce the best implementation. The description of PDF is fully public, and Adobe still dominates the market for PostScript software. Publishing the spec has just made the pie bigger. There’s still quite a lot of software that uses unpublished proprietary specs, though, and it’s risky to rely on the long-term reliability of the files they produce.

Registry browser update

I’ve made some changes to the format registry browser since yesterday. Changes include a help page, ability to use the “/” (slash) character in searches (very helpful when searching MIME types), and links to the registry entries from search results (not working right for PRONOM).

I attempted to make the search fields persist through a session, but that isn’t working, even though it works on the local emulation. Hopefully I’ll figure that out.

Google App Engine is a pain to work with, even though it’s free and has a number of simplifying features. It’s good for getting a quick demo up, though.

To get started, you need to get an Eclipse plugin from Google. Then you need to create a Google web application project, which needs to be in just the structure they want. It needs to have a top-level directory called “war,” and that needs to have a file called WEB-INF/appengine-web.xml. If you’re starting a project from scratch, that’s not too heavy a requirement; other web application servers will just ignore that special file. But since I was working from an existing project, the differences were just enough that I had to create a separate Eclipse project for the Google version. Still, not too bad. The project is there and running. I don’t even need to run Ant; the plugin magically finds my classes. It also provides an emulation environment and simple uploading.

This morning I was working on a few enhancements on the main line when the Google version spontaneously rebuilt itself. The console reported:


DataNucleus Enhancer (version 3.1.0.m2) : Enhancement of classes
DataNucleus Enhancer completed with success for 0 classes. Timings : input=401 ms, enhance=0 ms, total=401 ms. Consult the log for full details
DataNucleus Enhancer completed and no classes were enhanced. Consult the log for full details

Now there were errors in Java files which are used only in the GUI version and reference AWT and Swing classes. An example: “java.awt.Dimension is not supported by Google App Engine's Java runtime environment.” Fortunately, my code is clean enough that I could fix the problem by deleting a few classes and turning one into a stub. Still, such a pain. There certainly are web applications that use AWT for offscreen drawing, and they just won’t work with Google. There have been complaints about this.

The environment is good enough for its purpose, but I wouldn’t try to do serious work with it.

Format registry browser online

In an effort to promote interest in my format registry browser, I’ve built a Java web application out of it and put it up on Google App Engine at regbrowser.appspot.com. It lets you search PRONOM, UDFR, and the DBPedia structured summaries of format articles, by name, MIME Type, creator, and extension. It uses SPARQL Linked Data queries to obtain data.

It’s still in a rough form; the point is to show what it can do and hopefully get some interest in putting money into further development. Obvious improvements, which I may do shortly, would include checkboxes for which repositories to search and retention of text fields when returning to the search page.

UDFR times out a lot. If you get a timeout error, trying again has a good chance of working.

Have some JHOVE with your turkey

The trouble with being self-employed is you’re still at work on the holidays.

I’ve given up on the Kickstarter project for JHOVE, since the lack of any feedback on what features users want suggests insufficient interest, but I’m continuing with the bug fixes. there’s a new JHOVE alpha release on SourceForge, mostly to fix some silly mistakes. Please report any problems.

Here’s the new portion of the release notes:

GENERAL

  1. Jhove.java and JhoveView.java now get their version information from JhoveBase.java. Before it was redundantly kept in three places, and sometimes they didn’t all get updated for a new release. Like in 1.8.
  2. ConfigWriter was in the package edu.harvard.hul.ois.jhove.viewer, which caused a NoClassDefFoundError if non-GUI configurations didn’t include JhoveViewer.jar in the classpath. It’s been moved to edu.harvard.hul.ois.jhove.
  3. Added script packagejhove.sh and made md5.pl part of the CVS repository to make packaging for delivery easier.

TIFF MODULE

  1. TIFF files that had strip or tile offsets but no corresponding byte counts were throwing an exception all the way to the top level. Now they’re correctly being reported as invalid.

XML MODULE

  1. Cleaned up reporting of schemas, Added some small classes to replace the use of string arrays for information structures.

Preliminary JHOVE fix list

I’m working toward a Kickstarter proposal that will cover what I’ll try to make two weeks of work. That will let me set a relatively modest funding goal, which seems wise for a first project. As I look at things more closely, a PDF 1.7 upgrade as such looks like the wrong way to go; I’m not seeing PDF 1.7 features that break JHOVE. What I’m seeing, rather, is an assortment of problems in the PDF module that can break files for reasons not closely tied to their version, and some features that would be very nice to add.

Here’s a first go-around on issues with JHOVE 1.8 which I’m considering addressing. I’m open to other issues as well. Comments on which of these are most important would help me to set up my project proposal.

• PDF files that open with Acrobat and with OS X Preview are being declared not well-formed or not valid because JHOVE is encountering one kind of PDF object when it expects another. Figuring this out may require some digging into the specs and implementation notes. Bug reports which I’ve added myself on SourceForge include Casting exceptions in PDF module, Problem with PDF annotation dictionaries, and PDF module doesn’t recognize all encryption algorithms.

• The PDF module recognizes PDF/X through version 3 but not 4. Adding a profile for PDF/X-4 looks simple.

• The PDF module recognizes PDF/A-1 but not 2 and 3. This could be a significant amount of work, possibly less if I incorporate a third-party library.

• The PDF module doesn’t recognize encryption algorithm 5, although it’s been around since PDF 1.5. The fix for this is easy.

• In general, PDF module error messages are less useful than they should be. More specific information and better logging would help in diagnosing any issues.

• The optional document requirements dictionary is a new feature in 1.7. This gives information on what an application needs to do to process the document correctly, which sounds like a useful preservation feature. Reporting a requirements property would be good.

• “Portable collections” of embedded files are a new feature in PDF 1.7. This sounds like a property worth reporting.

• AIFF files created by iTunes, and perhaps by other software, use the “ID3” chunk for metadata. The AIFF module knows nothing about it. Parsing this chunk and reporting the metadata might be useful. I’ve already written code that does this for another project.

• The UTF-8 module is at Unicode 6.0, and Unicode is at 6.2. Updating this is straightforward grunt work.

• There’s been a request to check if TIFF files share storage between tags, which is not allowed by the spec. This most often happens by design when two values, such as XResolution and YResolution, are known to have the same value, but could also indicate a corrupted file.

• There has been some discussion of checking whether TIFF tile and strip content goes outside file boundaries. There are limits on how rigorously that can be done, since lengths depend on compression methods that are outside JHOVE’s scope, but some idiot-proofing is possible.

All of this together is far more than two weeks of work, of course. Which of these are most important to you? What else should be on the list?

Worldwide file ID hackathon

What happens when you get a bunch of developers from all over the world together on the Internet for one day of intensive work? A lot! For one thing, there’s the “Louis Wu’s birthday” effect; this “24-hour hackathon” was more like 48 hours. (In Niven and Pournelle’s Ringworld, Wu makes his birthday party last 48 hours by hopping from time zone to time zone with teleporters.) We didn’t have teleporters, so we made do with Twitter, IRC, and Google Hangouts. People in Australia started, and things wound down on the US west coast or maybe Hawaii.

Several things were happening, but the two most notable from my perspective were the Format Corpus project and the fork of FITS.

I watched the Format Corpus project with interest, though I didn’t participate in it. This is an openly licensed set of small example files in a wide variety of formats, as well as signature information. It could have a lot of uses; I’ll need to incorporate it into JHOVE testing.

People had been talking in advance of the hackathon about the need to improve the efficiency of FITS, a meta-tool developed by Harvard’s OIS (now LTS) to run various validation tools together on files. Internal ingest was and is the main purpose of FITS, but it was put up as open source and has been used in other places. I’d never worked on FITS proper at OTS (though I wrote parts of OTS-Schemas, which was broken out of FITS), but I’m familiar with the OIS style of coding, so I forked it on to Github and started looking at it. When Randy Stern at Harvard expressed concerns that the fork would create confusion (though I’d put a clear disclaimer from the beginning that it wasn’t the official version), I renamed it to OpenFITS.

The work is summarized on the hackathon wiki. The results are unclear at this point, but just opening the code up to more eyes could produce long-term benefits. The very first file I tested FITS on turned up a bug in JHOVE, and I wound up doing more work improving JHOVE than FITS. One source of potential significant improvements that I added was the ability to specify local copies of any XML schema. If you’re validating a lot of XML files that use the same schema, JHOVE has to get it from the Web, slowing the processing down. It’s necessary to do local configuration to take advantage of this, since every installation could need different schemas. The code is checked in but not available in a build yet.

It was thrilling to get to work with such an enthusiastic crowd from so many different places and, in a single 48-hour day, to see other people picking up my work and running it. I think there are already two or three third-generation forks of OpenFITS, including a Debian-Ubuntu package.

Expanding JHOVE

There are some significant improvements I’d like to make to JHOVE, to bring it up to date and improve its availability. The most important of these is to bring the PDF module up to version 1.7 (ISO 32000). I’ve done two releases since leaving Harvard, and download figures and feedback show there’s still significant interest. I’ve done that much to enhance my reputation, but I need to earn a living, and the PDF upgrade would be two or three weeks of solid work, so it has to be contingent on my getting compensated.

Features which look most important for JHOVE’s usual purposes include enhancements to Tagged PDF, Unicode file name references, new markup features, and dictionaries which support 3D artwork. I’m guessing there’s also interest in supporting PDF/A-2 and 3.

There’s probably no one institution right now willing to pay for the effort, but if it were possible to get a few hundred dollars from each of several institutions, it could work. One thought, of course, is Kickstarter, but I don’t know if institutional money can be funneled that way. Maybe it can and I just don’t know it. Alternatively, I can write application letters to the appropriate places, saying that I’ll do it if the amount pledged exceeds a certain threshold. No doubt it would take months for this to happen, but it seems possible in principle.

The idea could even be generalized to a library consortium for funding useful open source projects in return for support. Yes, I’m obviously thinking of how I can make money and I’m not apologizing for it. But the idea really could be useful. The SQLite consortium is a similar approach, focused on a single product.

Does anyone know of similar funding models that have worked, or alternative approaches that would achieve the result? Does the idea make sense or am I just blowing hot air?

Notes on Friday’s Hackathon

The information on just how Friday’s CURATEcamp 24 hour worldwide file id hackathon will work has been tricky for me to find, so here’s a summary for participants who read this blog:

Twitter: Hashtag #fileidhack
IRC: Server is irc.oftc.net, channel is #openarchives

The information is on the main wiki page for the hackathon, but it’s a little hard to spot with everything else that’s there.

See some of you there!

Embracing the chaos of formats

We often think of formats in terms of specifications and standards, and this can be a useful thing. If you want to know exactly what the Latin-1 encoding is, you can look at the ISO-8859-1 standard and it will tell you. However, this isn’t always a reliable guide to what’s out there. Someone noticed that ISO-8859 reserves lots of control codes that are rarely used and put additional printing characters there. This got codified as well, as Windows 1252 (which Microsoft falsely claims as an ANSI standard), but there are many ad hoc or obscure encodings which are hard or impossible to find references for.

Earth’s official authorities refused to grant the Klingons a place in Unicode for their characters; nonetheless, there is an unofficial registry that uses part of the Unicode Private Use Area for Klingon and other constructed scripts. Is it official Unicode? No. If you use code points F8D0-F8FF, will others recognize them as Klingon characters? Sometimes.

I’ve written about the TIFF situation before. The TIFF 6.0 spec is an insufficient guide to today’s real-life TIFF. You have to go through scattered tech notes to understand how it’s really used.

Understanding situations like these requires understanding that formats don’t flow unchanged from the minds of their designers to their implementation in the world’s computers. People change things to meet their needs. This makes them more useful for some purposes; at the same time, it makes them more confusing. The only alternative would be to create a format police force with the power to arrest and punish innovators.

The situation is analogous to natural language. You can insist that anything that disagrees with the grammar books is wrong, but if everybody talks that way, there ain’t no stoppin’ it. At the same time, the grammar books put a brake on unnecessary change, keeping the language from breaking down into a thousand mutually unintelligible dialects.

Digital preservationists have to look at the actual usage of formats, not just their official specifications. This doesn’t mean that they should accept every deviation, but they need to acknowledge changes that have become de facto standards. Context matters; an archive of ninteenth-century literature doesn’t have to be concerned with Klingon characters, but an archive of science fiction fan literature had better take them into account. Even an occasional scholarly paper might have a word or two in the pIqaD script.

This proliferation of variants is a big part of why centralized registries of format information don’t work. Not only is there too much information, it keeps changing. The best we can hope for is a coordinated way of finding our way through a chaotic body of information.

JHOVE 1.8

I hadn’t heard any bug reports since 1.8 beta, which hopefully means it’s working smoothly for everyone, so I’ve now released JHOVE 1.8. Let me know ASAP if anything’s broken.

Release notes:

GENERAL

1. If JHOVE doesn’t find a configuration file, it creates a default one.

2. Generics widely added to clean up the code.

3. build.xml files fixed to force compilation to Java 1.5.

4. Shell script “jhove” no longer makes you figure out where JAVA_HOME is.

PDF MODULE

1. Several errors in checking for PDF-A compliance were corrected. Aside from fixing some outright bugs, the Contents key for non-text Annotations is no longer checked, as its presence is only recommended and not required.

2. Improved code by Hökan Svenson is now used for finding the trailer.

TIFF MODULE

1. TIFF tag 700 (XMP) now accepts field type 7 (UNDEFINED) as well as 1 (BYTE), on the basis of Adobe’s XMP spec, part 3.

2. If compression scheme 6 is used in a file, an InfoMessage will report that the file uses deprecated compression.

WAVE MODULE

1. The Originator Reference property, found in the Broadcast Wave Extension (BEXT) chunk, is now reported.