Defining the file format registry problem

My previous post on format registries, which started out as a lament on the incomplete state of UDFR, resulted in an excellent discussion. Along the way I came upon Chris Rusbridge’s post pointing out that finding a solution doesn’t do much good if you don’t know what problem you’re trying to solve. This links to a post by Paul Wheatley on the same subject. Paul links back to this blog, nicely closing the circle.

So what are we trying to do? A really complete digital format registry sounds like a great idea, but what practical problem is it trying to solve? We know it’s got something to do with digital preservation. If we have a file, we need to know what format it’s in and what we can do about it. If it’s in a well-known format such as PDF or TIFF, there’s no real problem; it’s easy enough to find out all you need to know. It’s the obscure formats that need one-stop documentation. If you find a file called “importantdata.zxcv” and a simple dump doesn’t make sense of it, you need to know where to look. You need answers to questions like: “What format is it in?” “What category of information does it contain?” “How do I extract information from this file?” “How do I convert it with as little loss as possible into a better supported format?”

I have a 1994 O’Reilly book called Encyclopedia of Graphics File Formats. If old formats are a concern of yours, I seriously suggest searching for a copy. (Update: It turns out the book is available on fileformat.info!) It covers about a hundred different formats, generally in enough detail to give you a good start at implementing a reader. There are names which are still familiar: TIFF, GIF, JPEG. Many others aren’t even memories except to a few people. DKB? FaceSaver?

With some formats the authors just admit defeat in getting information. The case of Harvard Graphics (apparently no connection to Harvard University) is particularly telling. The book tells us:

Software Publishing, the originator of the Harvard Graphics format, considers this format to be proprietary. Although we wish this were not the case, we can hardly use our standard argument — that documenting and publicizing file formats make sales by seeding the aftermarket. Harvard Graphics has been the top, or one of the top, sellers in the crowded and cutthroat MS-DOS business graphics market, and has remained so despite the lack of cooperation of Software Publishing with external developers.

While we would be happy to provide information about the format if it were available, we have failed to find any during our research for this book, so it appears that Software Publishing has so far been successful in their efforts to restrict information flow from their organization.

This was once a widely used format, so if you’re handed an archive to turn into a useful form, you might get a Harvard Graphics file. How do you recognize it as one? That isn’t obvious. A little searching reveals you can still get a free viewer for older versions of Windows, but nothing is mentioned about converting it to other formats. Even knowing there’s software available isn’t helpful till you can determine that a file is Harvard Graphics.

If you have a file — it’s Harvard Graphics, but you don’t know that — what do you want from a registry? First, you want a clue about how to recognize it. An extension or a signature, perhaps. When you get that, you want to know what kind of data the file might hold: In this case, it’s presentation graphics. Then you want to know how to rescue the data. Knowing that the viewer exists would be a start. Knowing that technical information isn’t available (if that’s still true) would save fruitless searching.

Information like this is scattered and dynamic. If the Harvard Graphics spec isn’t publicly available now, it’s still possible for its proprietors to relent and publish it. The notion of one central source of wisdom on formats is an impossibility. What’s needed is a way to find the expertise, not to compile it all in one place.

We need to concentrate not on a centralized stockpile of information but a common language for talking about formats. PRONOM uses one ontology. UDFR uses another. DBPedia doesn’t have an applicable standard. What I envision is any number of local repositories of formats, all capable of delivering information in the same way. The ones from the big institutions would carry the most trust, and they’d often share each other’s information. Specialists would fill in the gaps by telling us about obscure formats like uRay and Inset PIX, or they’d provide updates about JPEG2000 and EPub more regularly than the big generalists can. The job of the big institutions is to standardize the language so we aren’t overwhelmed by heterogeneous data.

Let’s look again at those questions I mentioned, as they could apply to this scenario.

What format is it in? The common language needs a way to ask this question. Given a file extension, or the hex representation of the first four bytes of the file, you’d like a candidate format, and there might be more than one. You’d like to be able to search across a set of repositories for possible answers.

What category of information does it contain? When you get an answer about the format, it should tell you briefly what it’s for. If you got multiple answers in your first query, this might help to narrow it down.

How do I extract information? Now you want to get some amount of information, maybe just enough to tell you whether it’s worth pursuing the task or not. The registry will hopefully give you information on the technical spec or on available tools.

How do I convert it? When you decide that the file has valuable information but isn’t sufficiently accessible as it stands, you need to look for conversion software. A central registry has to be cautious about what it recommends. A plurality of voices can offer more options (and, to be sure, more risk).

This vision is what I’d like to call ODFR — the Open Digital Format Registry — even though it wouldn’t be a single registry at all.

4 responses to “Defining the file format registry problem

  1. If you’ll forgive me playing Devil’s Advocate a little, what’s wrong with what we’ve got? We’ve got identification tools we can use and improve in order to address the first question, and Google for the rest! I looked up ‘Inset PIX’ and the top hit was a nice detailed format spec. So what’s the problem? On top of that, the scenario you’ve outlined is an almost entirely manual process, so what’s the advantage of having a machine readable data model?

    If we do make this work, but it’s not centralised, how do we use it? What queries do we launch where?

    Sorry for being a bore about this, but I think we need some solid examples of how it will make things better in order to encourage people to invest in it.

  2. I can think of two answers to your question. First, the preservationist doesn’t generally start knowing what format something is in. If you do, it makes things much easier. More typically, though, you start with a file and have to figure out what it is from things like its extension and signature.

    If this were the only problem, then a tool like DROID plus a search engine might be enough, but the second issue is that there are problems with what you can find with search engines. It’s difficult to get the granularity that’s needed. I’ve found search engines have gotten less and less useful over the years as they try to guess what you really want rather than giving you control over a search. There’s a tendency to emphasize new documents heavily over older ones, making it hard to get information about older versions.

    I didn’t really talk about how the information in the hypothetical common language would be produced. It could be manual or rely on data mining of documents that are reasonably regular to start with. The question is how to make it worth the effort for anyone to do it and how to herd the cats in something like the same direction. The effort has to be low and the benefit in visibility enough to justify it. One possible lever would be to work on a format-specific ontology for DBPedia, with some reasonable commonality with existing format-related ontologies. Common, or at least easily mapped, terminology is a key.

    I’m working on a proof-of-concept demo of querying a couple of linked data sources, DBPedia and PRONOM (or maybe P2; the PRONOM SPARQL endpoint doesn’t seem to be responding) to start with. I’ll make it available as a Java application to help get an idea of what sort of queries might be possible.

  3. Hi Gary,

    Your post seemed to be assuming something about what we are trying to achieve with digital preservation. Answers to some of your questions would be easier to come up with if we had a common goal that we were aiming for in the field.
    For example, are we trying to maintain an ability to interact with objects? Or are we trying to maintain an ability to interact with some (often seemingly random) portion of the content of objects?

    Until we answer that fundamental question I suspect we will keep going around and around in circles about the format identification and format registries issue.

    If you are interested, I have written about some of these things in the comments here: http://www.openplanetsfoundation.org/comment/371 and in my post that discusses how potentially unimportant format validation and identification is here: http://digitalcontinuity.org/post/20400819609/incorporating-emulation-into-a-business-as-usual

    Thanks for the thought provoking post.

    Euan Cochrane