Category Archives: commentary

Defining the file format registry problem

My previous post on format registries, which started out as a lament on the incomplete state of UDFR, resulted in an excellent discussion. Along the way I came upon Chris Rusbridge’s post pointing out that finding a solution doesn’t do much good if you don’t know what problem you’re trying to solve. This links to a post by Paul Wheatley on the same subject. Paul links back to this blog, nicely closing the circle.

So what are we trying to do? A really complete digital format registry sounds like a great idea, but what practical problem is it trying to solve? We know it’s got something to do with digital preservation. If we have a file, we need to know what format it’s in and what we can do about it. If it’s in a well-known format such as PDF or TIFF, there’s no real problem; it’s easy enough to find out all you need to know. It’s the obscure formats that need one-stop documentation. If you find a file called “importantdata.zxcv” and a simple dump doesn’t make sense of it, you need to know where to look. You need answers to questions like: “What format is it in?” “What category of information does it contain?” “How do I extract information from this file?” “How do I convert it with as little loss as possible into a better supported format?”

I have a 1994 O’Reilly book called Encyclopedia of Graphics File Formats. If old formats are a concern of yours, I seriously suggest searching for a copy. (Update: It turns out the book is available on fileformat.info!) It covers about a hundred different formats, generally in enough detail to give you a good start at implementing a reader. There are names which are still familiar: TIFF, GIF, JPEG. Many others aren’t even memories except to a few people. DKB? FaceSaver?

With some formats the authors just admit defeat in getting information. The case of Harvard Graphics (apparently no connection to Harvard University) is particularly telling. The book tells us:

Software Publishing, the originator of the Harvard Graphics format, considers this format to be proprietary. Although we wish this were not the case, we can hardly use our standard argument — that documenting and publicizing file formats make sales by seeding the aftermarket. Harvard Graphics has been the top, or one of the top, sellers in the crowded and cutthroat MS-DOS business graphics market, and has remained so despite the lack of cooperation of Software Publishing with external developers.

While we would be happy to provide information about the format if it were available, we have failed to find any during our research for this book, so it appears that Software Publishing has so far been successful in their efforts to restrict information flow from their organization.

This was once a widely used format, so if you’re handed an archive to turn into a useful form, you might get a Harvard Graphics file. How do you recognize it as one? That isn’t obvious. A little searching reveals you can still get a free viewer for older versions of Windows, but nothing is mentioned about converting it to other formats. Even knowing there’s software available isn’t helpful till you can determine that a file is Harvard Graphics.

If you have a file — it’s Harvard Graphics, but you don’t know that — what do you want from a registry? First, you want a clue about how to recognize it. An extension or a signature, perhaps. When you get that, you want to know what kind of data the file might hold: In this case, it’s presentation graphics. Then you want to know how to rescue the data. Knowing that the viewer exists would be a start. Knowing that technical information isn’t available (if that’s still true) would save fruitless searching.

Information like this is scattered and dynamic. If the Harvard Graphics spec isn’t publicly available now, it’s still possible for its proprietors to relent and publish it. The notion of one central source of wisdom on formats is an impossibility. What’s needed is a way to find the expertise, not to compile it all in one place.

We need to concentrate not on a centralized stockpile of information but a common language for talking about formats. PRONOM uses one ontology. UDFR uses another. DBPedia doesn’t have an applicable standard. What I envision is any number of local repositories of formats, all capable of delivering information in the same way. The ones from the big institutions would carry the most trust, and they’d often share each other’s information. Specialists would fill in the gaps by telling us about obscure formats like uRay and Inset PIX, or they’d provide updates about JPEG2000 and EPub more regularly than the big generalists can. The job of the big institutions is to standardize the language so we aren’t overwhelmed by heterogeneous data.

Let’s look again at those questions I mentioned, as they could apply to this scenario.

What format is it in? The common language needs a way to ask this question. Given a file extension, or the hex representation of the first four bytes of the file, you’d like a candidate format, and there might be more than one. You’d like to be able to search across a set of repositories for possible answers.

What category of information does it contain? When you get an answer about the format, it should tell you briefly what it’s for. If you got multiple answers in your first query, this might help to narrow it down.

How do I extract information? Now you want to get some amount of information, maybe just enough to tell you whether it’s worth pursuing the task or not. The registry will hopefully give you information on the technical spec or on available tools.

How do I convert it? When you decide that the file has valuable information but isn’t sufficiently accessible as it stands, you need to look for conversion software. A central registry has to be cautious about what it recommends. A plurality of voices can offer more options (and, to be sure, more risk).

This vision is what I’d like to call ODFR — the Open Digital Format Registry — even though it wouldn’t be a single registry at all.

The state of file format registries

Looking through UDFR is like walking through a ghost town that still shows many signs of its former promise. The UDFR Final Report (PDF) helps to explain this; it’s a very sad story of a brilliant idea that encountered tons of problems with deadlines and staffing. What’s there is hard to use and, as far as I can tell, isn’t getting used much. I don’t see any signs of recent updates.

The website is challenging for the inexperienced user, but this wouldn’t matter so much if it exposed its raw information so developers could write front ends for specific needs. Chris Prom wrote that “it is a great day for practical approaches to electronic records because all kinds of useful tools and services can and will be developed from the UDFR knowledge base.” But I just can’t see how. I wrote to Stephen Abrams a while back about problems I was encountering (including my inability to log in in Firefox — I’ve since found I can log in in Safari), and his reply gave the sense that the project team had exhausted its resources and funding just in putting the repository up on the Web.

The source code is supposed to be on GitHub, but all that I see there is four projects, three of which are forks of third-party code and the fourth just some OWL ontology files.

If it were possible to access the raw data by RESTful URLs, even that would be something. So far I haven’t found a way to do that.

In fairness, I have to admit I was part of the failure of UDFR’s predecessor GDFR. The scope of the project was too ambitious, and communication between the Harvard and OCLC developers was a problem.

The most successful format registry out there is PRONOM. Programmatic access to its data is provided with DROID. GDFR and UDFR, with “global” and “unified” in their names, both grew from a desire to have a registry that everyone could participate in. PRONOM accepts contributions, but it’s owned by the UK National Archives, and this bothers some people, but it’s the most useful registry there is. The PRONOM site itself expresses the hope that UDFR “will support the requirements of a larger digital preservation community,” and it still would be great if that could happen.

Occasionally some people have discussed the idea of an open wiki for file format information. This would allow more free-form updates than the registries, and if combined with the concept of the semantic wiki, could also be a source of formalized data. I’m inclined to believe that’s the best way to implement an open repository.

The two faces of HTML5

The question “What is HTML5?” has gotten more complicated. While W3C continues work on a full specification of HTML5, the Web Hypertext Application Technology Working Group (WHATWG) is pursuing a “living standard” approach that is frequently updated. Both groups are reassuring us that this doesn’t constitute a rift, but certainly it will make things tricky when resolving the fine points of the standard(s). Ian Hickson has gone into some detail on the W3C site about the relationship between the WHATWG HTML living standard and the W3C HTML5 specification.

The WHATWG “HTML Living Standard” site significantly has no version number.

Considering that HTML5 is already widely implemented even though it won’t be finalized till the year after next, it’s unlikely this will add any further confusion. By the time it becomes a W3C Recommendation, many implementers will doubtless have moved beyond it to new features.

The horrible state of Java image processing

A while back I posted on the painfully poor choices in creating thumbnails of JPEG2000 files. Since then I’ve come to realize that support for image file processing in Java is even worse than I’d realized. Now I’m trying to make thumbnails from TIFF files. At first I went with JAI, even though it hasn’t been supported for five years and relies on implementation-dependent classes. I’d done this before successfully, but now I’m trying to do it in an EJB under JBoss. This runs into a NoClassDefFoundError trying to get com.sun.image.codec.jpeg.JPEGCodec. A web search suggests there’s some obscure trick necessary to access com.sun.image, but I couldn’t figure it out. It occurred to me that for what I’m doing, javax.imageio should be sufficient to do the job. It can read an image file, standard Java classes can scale the BufferedImage it produces, and then it can write the scaled image to a file.

Only one trouble: javax.imageio knows nothing about TIFF. A search on imageio and TIFF leads to suggestions to use JAI.

Really, what kind of language is that poor in dealing with common image formats?

Scalzi on DRM

Mostly it’s technogeeks like us who get passionate about file format issues—Word vs. Open Office, Latin-1 vs. Unicode, unrestricted PDF vs. PDF/A. But when issues like digital rights management (DRM) come in, a lot more people will weigh in. This week quite a lot of attention has come to the format in which John Scalzi’s new novel, Redshirts, was issued. Scalzi wrote in his blog:

As noted in the FAQ I just put up, Redshirts is going to be released as an eBook here in the US without digital rights management software (DRM), meaning what when you buy it you can pretty much do what you want with it. Tor, my publisher, announced that all their eBooks would be released DRM-free by the end of July; I support this and asked Redshirts be released DRM-free from release date, so I think it might be the first official DRM-free release from Tor, which is in itself the first major publisher imprint to forgo DRM. In that way, Redshirts is a bit of a canary in a coal mine for major publishers.

However, some things went wrong. Several e-book sale sites issued Redshirts in DRM, against his express wishes. Tor and Macmillan quickly went after those sites, and most or all of them have either dropped the book or switched to offering it DRM-free.

In April Scalzi wrote: “As an author, I haven’t seen any particular advantage to DRM-laden eBooks; DRM hasn’t stopped my books from being out there on the dark side of the Internet. Meanwhile, the people who do spend money to support me and my writing have been penalized for playing by the rules.”

From the standpoint of preservation, the big problem with DRM e-books is that they will inevitably become unreadable in not too many years. Publishers will switch to new, incompatible DRM schemes or completely drop support for their older e-books. They can’t keep actively supporting old technology forever. I have no objection to it for enforcing limited access such as library loans, but if you buy a product with DRM, you’re really just leasing it for an unknown period of time.

I’ll be ordering the book shortly, and I’m waiting for the day when we can say of DRM in books for sale: “It’s dead, Jim.”

JPEG2000 thumbnails

I’ve been trying to find software for batch generation of thumbnails for JPEG2000 images. So far this is what I’ve looked at:

Kakadu is commercial software that looked hopeful at first, but the licensing is confusing. The description of the “Non-commercial, Named User Licence” says it “can only be purchased by individuals, Academic Institutions, not-for-profit organizations and libraries which do not gain financially by using this software,” but the license itself doesn’t say anything about licensing to institutions, only individuals. Our attempts to get a clarification have gotten no response. If they ignore us when we want to buy something, that doesn’t bode well for support.

OpenJPEG has its supporters, but it has a command line API which can’t create JPEG, GIF, or PNG, and it can’t create images of a specified size. There are C functions which may or may not be directly callable, but the documentation for them is really scanty.

ImageMagick didn’t seem appealing at first because of its command-line orientation, but it may be the best option. JMagick provides a JNI connection. The documentation indicates it can generate images of a specified size and format, which is what we need.

If anybody reading this has other suggestions, let me know.

The Lib-Ray project

Just last weekend I got my first Blu-Ray disk and found that it came with a warning that if I didn’t have the latest software updates on my player, it might not play. (It did play, being far older than my player.) This annoyed me enough that I’m glad to hear of an open-source, non-DRM alternative to Blu-Ray in the works. Lib-Ray is a project to create a high-definition video standard with “no DRM,” “no region codes,” “no secrets,” and “no limits.” There’s a Kickstarter page looking for funding for the project.

According to the current specification, Lib-Ray uses the Matroska (MKV) container format.

Creating a mass market for Lib-Ray player boxes sounds like a long shot, but it’s easy enough to imagine open-source software being developed and distributed that would let any modern computer play the disks. This could be a boon to anyone who wants to distribute high-quality video discs without DRM.

Some articles on Lib-Ray: