Monthly Archives: September 2012

HTML5 schedule

The HTML Working Group Chairs and the Protocols and Formats WG Chair have proposed a plan for making HTML5 a Recommendation by the end of 2014. Features would be postponed to subsequent releases as necessary.

Accomplishing this, of course, requires that the proposal be accepted by the end of 2014.

Spruce Awards: signal boost and self-promotion

Applications for SPRUCE Awards are now open.

SPRUCE will make awards of up to £5k available for further developing the practical digital preservation outcomes and/or development of digital preservation business cases, that were begun in SPRUCE events. Applications from others may also be considered, but in this case, please discuss your proposal with SPRUCE before submission. A total fund of £60k is available for making these awards, which will be allocated in a series of funding calls thoughout the life of the SPRUCE Project.

The current (open) call is primarily for attendees of the SPRUCE Mashup London.

Awards must be submitted by 5 PM (GMT, I suppose) on October 10, 2012.

The self-promotion part: Awards are made to teams affiliated with institutions, but they are permitted to use outside help, since in-house developers may already be fully committed. As an independent developer with expertise in file formats and digital preservation, I’d like it known that I’m available to contract for carrying out a SPRUCE project. My business home page describes my background and skills. Paul Wheatley has told me this is a possibility, so I’m not just coming out of the blue with this offer.

My schedule may change, of course, but if you contact me on a project I’ll keep you updated on my status, and I’ll follow through in full on any commitment I make.

Format registry browser updated

I’ve posted an updated version of my file format registry browser (Zip file). It’s still very experimental, but this one makes several steps on the long path to being a useful tool.

The biggest news is that thanks to David Underdown’s input, it now talks to PRONOM. Preserv2 is probably a lost cause, since it appears no current work is being done on it and its useful results were folded back into PRONOM. This version tries to prettify results containing URIs and “@en” tags. If you don’t like that you can turn it off with a check box. The search fields have changed, and all of them now do something with all registries. The logging level can now be controlled from the config file (src/com/mcgath/regbrowser/config.properties). In some future version I’ll use a less buried config file.

Here’s my post on the first release.

Administrative note

Due to high levels of spam, I’ve changed the comment settings so that you’re required to give a name and email address or be logged in. WordPress’s filters are very good about catching the spam, but I still have to empty it out, and the “lista de email” spammer has been pouring huge amounts of junk into my spam box. Hopefully this won’t inconvenience any legitimate commenters and will inconvenience spammers.

I won’t do anything with your address, and as far as I know WordPress doesn’t either.

Format conformity

By design JHOVE measures strict conformity to file format specifications. I’ve never been convinced this is the best way to measure a file’s viability or even correctness, but it’s what JHOVE does, and I’d just create confusion if I changed it now.

In general, the published specification is the best measure of a file’s correctness, but there are clearly exceptions, and correctness isn’t the same as viability for preservation. Let’s look at the rather extreme case of TIFF.

The current official specification of TIFF is Revision 6.0, dated June 3, 1992. The format hasn’t changed a byte in over 20 years — except that it has.

The specification says about value offsets in IFDs: “The Value is expected to begin on a word boundary; the corresponding Value Offset will thus be an even number.” This is a dead letter today. Much TIFF generation software freely writes values on any byte boundary, and just about all currently used readers accept them. JHOVE initially didn’t accept files with odd byte alignment as well-formed, but after numerous complaints it added a configuration option to allow them.

Over the years a body of apocrypha has grown around TIFF. Some comes from Adobe, some not. The titles of the ones from Adobe don’t clearly mark them as revisions to TIFF, but they are. The “Adobe PageMaker® 6.0 TIFF Technical Notes,” September 14, 1995, define the important concept of SubIFD, among other changes. The “Adobe Photoshop® TIFF Technical Notes,” March 22, 2002, define new tags and forms of compression. The “Adobe Photoshop® TIFF Technical Note 3,” April 8, 2005, adds new floating point types. The last one isn’t available, as far as I can tell, on Adobe’s own website, but it’s canonical.

Then there’s material without official Adobe approval. The JPEG compression defined in the 2002 tech notes is an official acceptance of a 1995 draft note that had already gained wide acceptance.

What’s the best measure of a TIFF file? That it corresponds strictly to TIFF 6.0? To 6.0 plus a scattered set of tech notes? Or that it’s processed correctly by LibTiff, a freely available and very widely used C library? To answer the question, we have to specify: Best for what? If we’re talking about the best chance of preservation, what scenarios are we envisioning?

One scenario amounts to a desert-island situation in which you have a specification, some files that you need to render, and a computer. You don’t have any software to go by. In this case, conformity to the spec is what you need, but it’s a rather unlikely scenario. If all existing TIFF readers disappear, things have probably gone so far that no one will be motivated to write a new one.

It’s more likely that people a few decades in the future will scramble to find software or entire old computers that will read obsolete formats. This doesn’t necessarily mean today’s software, but what we can read today can be a pretty good guide to what will be readable in the future. Insisting on conformity to the spec may be erring to the safe side, but if it excludes a large body of valuable files, it’s not a good choice.

Rather than insisting solely on conformity to a published standard, preservation-worthy files need to be measured by a balance between accepting files that will cause reading problems down the road and rejecting files that won’t. Multiple factors come into consideration, of which the spec is just one.

Format registry browser available for download

I’ve made my experimental format registry browser available for download. It requires Java 5 or higher and a GUI environment, and Ant is required if you want to make changes. Currently it queries DBPedia and UDFR. It’s been tested on Mac OS X and Ubuntu.

And another JHOVE build

There’s now a build of JHOVE with some more changes, incorporating new code for finding the PDF trailer and making several fixes in PDF-A checking. The full build is a pain to do, so what I’ve done is uploaded a zip file that contains just the revised bin directory.

To use it, make a copy of JHOVE 1.7 (don’t blow away your old one!) and replace the bin directory with the bin directory from the zip file. Please give feedback on any problems encountered; this is definitely not a stable release.

Test version of JHOVE

I’ve put a new test build of the GUI version of JHOVE on SourceForge. This addresses one of the most persistent problems: the configuration file. If it can’t find the expected configuration file, it creates a default version.

I’ve tested this on a Mac and an Ubuntu box, but not on Windows, which is the toughest case because of its different and changing file system conventions. I’d greatly appreciate feedback on whether it works right on Windows, and which version you’ve tested with.

Defining the file format registry problem

My previous post on format registries, which started out as a lament on the incomplete state of UDFR, resulted in an excellent discussion. Along the way I came upon Chris Rusbridge’s post pointing out that finding a solution doesn’t do much good if you don’t know what problem you’re trying to solve. This links to a post by Paul Wheatley on the same subject. Paul links back to this blog, nicely closing the circle.

So what are we trying to do? A really complete digital format registry sounds like a great idea, but what practical problem is it trying to solve? We know it’s got something to do with digital preservation. If we have a file, we need to know what format it’s in and what we can do about it. If it’s in a well-known format such as PDF or TIFF, there’s no real problem; it’s easy enough to find out all you need to know. It’s the obscure formats that need one-stop documentation. If you find a file called “importantdata.zxcv” and a simple dump doesn’t make sense of it, you need to know where to look. You need answers to questions like: “What format is it in?” “What category of information does it contain?” “How do I extract information from this file?” “How do I convert it with as little loss as possible into a better supported format?”

I have a 1994 O’Reilly book called Encyclopedia of Graphics File Formats. If old formats are a concern of yours, I seriously suggest searching for a copy. (Update: It turns out the book is available on fileformat.info!) It covers about a hundred different formats, generally in enough detail to give you a good start at implementing a reader. There are names which are still familiar: TIFF, GIF, JPEG. Many others aren’t even memories except to a few people. DKB? FaceSaver?

With some formats the authors just admit defeat in getting information. The case of Harvard Graphics (apparently no connection to Harvard University) is particularly telling. The book tells us:

Software Publishing, the originator of the Harvard Graphics format, considers this format to be proprietary. Although we wish this were not the case, we can hardly use our standard argument — that documenting and publicizing file formats make sales by seeding the aftermarket. Harvard Graphics has been the top, or one of the top, sellers in the crowded and cutthroat MS-DOS business graphics market, and has remained so despite the lack of cooperation of Software Publishing with external developers.

While we would be happy to provide information about the format if it were available, we have failed to find any during our research for this book, so it appears that Software Publishing has so far been successful in their efforts to restrict information flow from their organization.

This was once a widely used format, so if you’re handed an archive to turn into a useful form, you might get a Harvard Graphics file. How do you recognize it as one? That isn’t obvious. A little searching reveals you can still get a free viewer for older versions of Windows, but nothing is mentioned about converting it to other formats. Even knowing there’s software available isn’t helpful till you can determine that a file is Harvard Graphics.

If you have a file — it’s Harvard Graphics, but you don’t know that — what do you want from a registry? First, you want a clue about how to recognize it. An extension or a signature, perhaps. When you get that, you want to know what kind of data the file might hold: In this case, it’s presentation graphics. Then you want to know how to rescue the data. Knowing that the viewer exists would be a start. Knowing that technical information isn’t available (if that’s still true) would save fruitless searching.

Information like this is scattered and dynamic. If the Harvard Graphics spec isn’t publicly available now, it’s still possible for its proprietors to relent and publish it. The notion of one central source of wisdom on formats is an impossibility. What’s needed is a way to find the expertise, not to compile it all in one place.

We need to concentrate not on a centralized stockpile of information but a common language for talking about formats. PRONOM uses one ontology. UDFR uses another. DBPedia doesn’t have an applicable standard. What I envision is any number of local repositories of formats, all capable of delivering information in the same way. The ones from the big institutions would carry the most trust, and they’d often share each other’s information. Specialists would fill in the gaps by telling us about obscure formats like uRay and Inset PIX, or they’d provide updates about JPEG2000 and EPub more regularly than the big generalists can. The job of the big institutions is to standardize the language so we aren’t overwhelmed by heterogeneous data.

Let’s look again at those questions I mentioned, as they could apply to this scenario.

What format is it in? The common language needs a way to ask this question. Given a file extension, or the hex representation of the first four bytes of the file, you’d like a candidate format, and there might be more than one. You’d like to be able to search across a set of repositories for possible answers.

What category of information does it contain? When you get an answer about the format, it should tell you briefly what it’s for. If you got multiple answers in your first query, this might help to narrow it down.

How do I extract information? Now you want to get some amount of information, maybe just enough to tell you whether it’s worth pursuing the task or not. The registry will hopefully give you information on the technical spec or on available tools.

How do I convert it? When you decide that the file has valuable information but isn’t sufficiently accessible as it stands, you need to look for conversion software. A central registry has to be cautious about what it recommends. A plurality of voices can offer more options (and, to be sure, more risk).

This vision is what I’d like to call ODFR — the Open Digital Format Registry — even though it wouldn’t be a single registry at all.