Category Archives: commentary

The disappearing format blues

Old formats sometimes fade into obscurity and can no longer be supported, even if they come from a big company like Microsoft. Chris Rusbridge has noted that Microsoft’s Open Specifications page only goes as far back as Office 97, and that PowerPoint 4.0 files can’t be opened with today’s Microsoft Office. Tony Hey at Microsoft has replied. (Hey is vice president of Microsoft Research Connections). The response was encouraging, particularly in suggesting that Microsoft might “participate in a ‘crowd source’ project working with archivists to create a public spec of these old file formats.”

There’s usually some kind of software around that can read old formats. A search turns doesn’t turn up a lot; there’s something called PowerPressed, which will wrap old PowerPoint files in a .exe application. It looks as if it should run on current Windows systems, but all I know is what that page says.

The situation shows the risk of using a format that isn’t publicly documented. Today this is less of a problem. I think it’s been shown that publishing format specs doesn’t lead to cannibalization of sales by competing software; the company that created the spec is in a position to produce the best implementation. The description of PDF is fully public, and Adobe still dominates the market for PostScript software. Publishing the spec has just made the pie bigger. There’s still quite a lot of software that uses unpublished proprietary specs, though, and it’s risky to rely on the long-term reliability of the files they produce.

Registry browser update

I’ve made some changes to the format registry browser since yesterday. Changes include a help page, ability to use the “/” (slash) character in searches (very helpful when searching MIME types), and links to the registry entries from search results (not working right for PRONOM).

I attempted to make the search fields persist through a session, but that isn’t working, even though it works on the local emulation. Hopefully I’ll figure that out.

Google App Engine is a pain to work with, even though it’s free and has a number of simplifying features. It’s good for getting a quick demo up, though.

To get started, you need to get an Eclipse plugin from Google. Then you need to create a Google web application project, which needs to be in just the structure they want. It needs to have a top-level directory called “war,” and that needs to have a file called WEB-INF/appengine-web.xml. If you’re starting a project from scratch, that’s not too heavy a requirement; other web application servers will just ignore that special file. But since I was working from an existing project, the differences were just enough that I had to create a separate Eclipse project for the Google version. Still, not too bad. The project is there and running. I don’t even need to run Ant; the plugin magically finds my classes. It also provides an emulation environment and simple uploading.

This morning I was working on a few enhancements on the main line when the Google version spontaneously rebuilt itself. The console reported:


DataNucleus Enhancer (version 3.1.0.m2) : Enhancement of classes
DataNucleus Enhancer completed with success for 0 classes. Timings : input=401 ms, enhance=0 ms, total=401 ms. Consult the log for full details
DataNucleus Enhancer completed and no classes were enhanced. Consult the log for full details

Now there were errors in Java files which are used only in the GUI version and reference AWT and Swing classes. An example: “java.awt.Dimension is not supported by Google App Engine's Java runtime environment.” Fortunately, my code is clean enough that I could fix the problem by deleting a few classes and turning one into a stub. Still, such a pain. There certainly are web applications that use AWT for offscreen drawing, and they just won’t work with Google. There have been complaints about this.

The environment is good enough for its purpose, but I wouldn’t try to do serious work with it.

Worldwide file ID hackathon

What happens when you get a bunch of developers from all over the world together on the Internet for one day of intensive work? A lot! For one thing, there’s the “Louis Wu’s birthday” effect; this “24-hour hackathon” was more like 48 hours. (In Niven and Pournelle’s Ringworld, Wu makes his birthday party last 48 hours by hopping from time zone to time zone with teleporters.) We didn’t have teleporters, so we made do with Twitter, IRC, and Google Hangouts. People in Australia started, and things wound down on the US west coast or maybe Hawaii.

Several things were happening, but the two most notable from my perspective were the Format Corpus project and the fork of FITS.

I watched the Format Corpus project with interest, though I didn’t participate in it. This is an openly licensed set of small example files in a wide variety of formats, as well as signature information. It could have a lot of uses; I’ll need to incorporate it into JHOVE testing.

People had been talking in advance of the hackathon about the need to improve the efficiency of FITS, a meta-tool developed by Harvard’s OIS (now LTS) to run various validation tools together on files. Internal ingest was and is the main purpose of FITS, but it was put up as open source and has been used in other places. I’d never worked on FITS proper at OTS (though I wrote parts of OTS-Schemas, which was broken out of FITS), but I’m familiar with the OIS style of coding, so I forked it on to Github and started looking at it. When Randy Stern at Harvard expressed concerns that the fork would create confusion (though I’d put a clear disclaimer from the beginning that it wasn’t the official version), I renamed it to OpenFITS.

The work is summarized on the hackathon wiki. The results are unclear at this point, but just opening the code up to more eyes could produce long-term benefits. The very first file I tested FITS on turned up a bug in JHOVE, and I wound up doing more work improving JHOVE than FITS. One source of potential significant improvements that I added was the ability to specify local copies of any XML schema. If you’re validating a lot of XML files that use the same schema, JHOVE has to get it from the Web, slowing the processing down. It’s necessary to do local configuration to take advantage of this, since every installation could need different schemas. The code is checked in but not available in a build yet.

It was thrilling to get to work with such an enthusiastic crowd from so many different places and, in a single 48-hour day, to see other people picking up my work and running it. I think there are already two or three third-generation forks of OpenFITS, including a Debian-Ubuntu package.

Embracing the chaos of formats

We often think of formats in terms of specifications and standards, and this can be a useful thing. If you want to know exactly what the Latin-1 encoding is, you can look at the ISO-8859-1 standard and it will tell you. However, this isn’t always a reliable guide to what’s out there. Someone noticed that ISO-8859 reserves lots of control codes that are rarely used and put additional printing characters there. This got codified as well, as Windows 1252 (which Microsoft falsely claims as an ANSI standard), but there are many ad hoc or obscure encodings which are hard or impossible to find references for.

Earth’s official authorities refused to grant the Klingons a place in Unicode for their characters; nonetheless, there is an unofficial registry that uses part of the Unicode Private Use Area for Klingon and other constructed scripts. Is it official Unicode? No. If you use code points F8D0-F8FF, will others recognize them as Klingon characters? Sometimes.

I’ve written about the TIFF situation before. The TIFF 6.0 spec is an insufficient guide to today’s real-life TIFF. You have to go through scattered tech notes to understand how it’s really used.

Understanding situations like these requires understanding that formats don’t flow unchanged from the minds of their designers to their implementation in the world’s computers. People change things to meet their needs. This makes them more useful for some purposes; at the same time, it makes them more confusing. The only alternative would be to create a format police force with the power to arrest and punish innovators.

The situation is analogous to natural language. You can insist that anything that disagrees with the grammar books is wrong, but if everybody talks that way, there ain’t no stoppin’ it. At the same time, the grammar books put a brake on unnecessary change, keeping the language from breaking down into a thousand mutually unintelligible dialects.

Digital preservationists have to look at the actual usage of formats, not just their official specifications. This doesn’t mean that they should accept every deviation, but they need to acknowledge changes that have become de facto standards. Context matters; an archive of ninteenth-century literature doesn’t have to be concerned with Klingon characters, but an archive of science fiction fan literature had better take them into account. Even an occasional scholarly paper might have a word or two in the pIqaD script.

This proliferation of variants is a big part of why centralized registries of format information don’t work. Not only is there too much information, it keeps changing. The best we can hope for is a coordinated way of finding our way through a chaotic body of information.

“Just solve the problem” month begins

Today is the start of a month which some digital preservationists have declared “Just Solve the Problem” month. I’ve already expressed a mixture of skepticism and hope for this; throwing resources pell-mell at a computer problem rarely works, but some good is bound to come of the effort. We will not come out of November with “the problem” solved, but there will be new resources, such as this page of links to format information. (This blog is included in the list.)

I’m working on a list of plain text formats, expanding on my earlier post on the subject. This will appear on garymcgath.com, hopefully within the next week. Also, I’ve started a page on the wiki on tools, with a relevant subset of the list on my own site, restricted to locally runnable applications.

Between this and the CURATEcamp hackathon on November 16, lots of interesting stuff is happening in preservation this month.

PDF/A-3

The latest version of PDF/A, a subset of PDF suitable for long-term archiving, is now available as ISO standard 19005-3:2012. According to the PDF/A Association Newsletter, “there is only one new feature with PDF/A-3, namely that any source format can be embedded in a PDF/A file.”

This strikes me as a really bad idea. The whole point of PDF/A is to restrict content to a known, self-contained set of options. The new version provides a back door that allows literally anything. The intent, according to the article, is to let archivists save documents in their original format as well as their PDF representation. Certainly saving the originals is a good archiving practice, but it should be done in an archival package, not in a PDF format designed for archiving.

Mission creep afflicts projects of all kinds, and this is a case in point.

A field guide to “plain text”

In some ways, plain text is the best preservation format. It’s simple and easily identified. It’s resilient when damaged; if a file is half corrupted, the other half is still readable. There’s just the little problem: What exactly is plain text?

ASCII is OK for English, if you don’t have any accented words, typographic quotes, or fancy punctuation. It doesn’t work very well for any other language. It even has problems outside the US, such as the lack of a pound sterling symbol; there’s a reason some people prefer the name US-ASCII. You’ll often find that supposed “ASCII” text has characters outside the 7-bit range, just enough of them to throw you off. Once this happens, it can be very hard to tell what encoding you’ve got.

Even if text looks like ASCII and doesn’t have any high bits set, it could be one of the other encodings of the ISO 646 family. These haven’t been used much since ISO 8859 came out in the late eighties, but you can still run into old text documents that use it. Since all the members of the family are seven-bit code and differ from ASCII in just a few characters, it’s easy to mistake, say, a French ISO-646 file for ASCII and turn all the accented e’s into curly braces. (I won’t get into prehistoric codes like EBCDIC, which at least can’t be mistaken for anything else.)

The ISO 8859 encodings have the same problem, pushed to the 8-bit level. If you’re in the US or western Europe and come upon 8-bit text which doesn’t work as UTF-8, you’re likely to assume it’s ISO 8859-1, aka Latin-1. There are, however, over a dozen variants of 8859. Some are very different in codes above 127, but some have only a few differences. ISO 8859-9 (Latin-5 or “Turkish Latin-1”) and ISO 8859-15 (Latin-9) are very similar. Microsoft added to the confusion with the Windows 1252 encoding, which turns some control codes in Latin-1 into printing characters. It used to be common to claim 1252 was an ANSI standard, even though it never was.

UTF-8, even without a byte order mark (BOM), has a good chance of being recognized without a lot of false positives; if a text file has characters with the high bit set and an attempt to decode it as UTF-8 doesn’t result in errors, it most likely is UTF-8. (I’m not discussing UTF-16 and 32 here because they don’t look at all ASCII-like.) Even so, some ISO 8859 files can look like good UTF-8 and vice versa.

So plain text is really simple — or maybe not.

Unicode

Words: Gary McGath, Copyright 2003
Music: Shel Silverstein, “The Unicorn”

A long time ago, on the old machines,
There were more kinds of characters than you’ve ever seen.
Nobody could tell just which set they had to load,
They wished that somehow they could have one kind of code.

   There was US-ASCII, simplified Chinese,
   Arabic and Hebrew and Vietnamese,
   And Latin-1 and Latin-2, but don’t feel snowed;
   We’ll put them all together into Unicode.

The users saw this Babel and it made them blue,
So a big consortium said, “This is what we’ll do:
We will take this pile of sets and give each one its place,
Using sixteen bits or thirty-two, we’ve lots of space

   For the US-ASCII, simplified Chinese,
   Arabic and Hebrew and Vietnamese,
   And Latin-1 and Latin-2, we’ll let them load
   In a big set of characters called Unicode.

The Klingons arrived when they heard the call,
And they saw the sets of characters, both big and small.
They said to the consortium, “Here’s what we want:
Just a little bit of space for the Klingon font.”

   “You’ve got US-ASCII, simplified Chinese,
   Arabic and Hebrew and Vietnamese,
   And Latin-1 and Latin-2, but we’ll explode
   You if you don’t put Klingon characters in Unicode.”

The Unicode Consortium just shook their heads,
Though the looks that they were getting caused a sense of dread.
“The set that we’ve assembled is for use on Earth,
And a foreign planet is the Klingons’ place of birth.”

   We’ve got US-ASCII, simplified Chinese,
   Arabic and Hebrew and Vietnamese,
   And Latin-1 and Latin-2, but you can’t goad
   Us into putting Klingon characters in Unicode.

The Klingons grew as angry as a minotaur;
They went back to their spaceship and declared a war.
Three hundred years ago this happened, but they say
That’s why the Klingons still despise the Earth today.

   We’ve got US-ASCII, simplified Chinese,
   Tellarite and Vulcan and Vietnamese,
   And Latin-1 and Latin-2, but we’ll be blowed
   If we’ll put the Klingon language into Unicode.

Preservation in the geek mainstream

Digital preservation issues are gaining notice in the geek mainstream, the large body of people who are computer-savvy but don’t live in the library-archive niche. Today we have an article in The Register, “British library tracks rise and fall of file formats.” It cites the British Library’s Andy Jackson, supporting the view that file formats remain usable for many years, even if they’re no longer the latest thing.

The Register article is short but nicely done. It naturally skips over issues which Andy’s original article deals with, like just how you reliably determine the formats of files. What’s significant is that it shows that concern about the long-term usability of files isn’t just a concern of a few specialists.

The URI namespace problem

Tying XML schemas to URIs was the worst mistake in the history of XML. Once you publish a schema URI and people start using it, you can’t change it without major disruption.

URIs aren’t permanent. Domains can disappear or change hands. Even subdomains can vanish with organizational changes. When I was at Harvard, I offered repeated reminders that hul.harvard.edu can’t go away with the deprecation of the name “Harvard University Library/Libraries,” since it houses schemas for JHOVE and other applications. Time will tell whether it will stay.

Strictly speaking, a URI is a Uniform Resource identifier and has no obligation to correspond to a web page; W3C says a URI as a schema identifier is only a name. In practice, treating it as a URL may be the only way to locate the XSD. When a URI uses the http scheme, it’s an invitation to use it as a URL.

Even if a domain doesn’t go away, it can be burdened with schema requests beyond its hosting capacity. The Harvard Library has been trying to get people to upgrade to the current version of JHOVE, which uses an entity resolver, but its server was, the last I checked, still heavily hit by three sites that hadn’t upgraded. They don’t pay anything, so there’s no money to put into more server capacity.

The best solution available is for software to resolve schema names to local copies (e.g. with Java’s EntityResolver). This solution often doesn’t occur to people until there’s a problem, though, and by then there may be lots of copies of the old software out in the field.

For archival storage, keeping a copy of any needed schema files should be a requirement. Resources inevitably disappear from the Web, including schemas. My impression is that a lot of digital archives don’t have such a rule and blithely assume that the resources will be available on the Web forever. This is a risk which could be eliminated at virtually zero cost, and it should be, but my impression is that a lot of archives don’t do this.

It’s legitimate to stop making a URI usable as a URL, though it may be rude. W3C’s Namespaces in XML 1.0 says: “The namespace name, to serve its intended purpose, SHOULD have the characteristics of uniqueness and persistence. It is not a goal that it be directly usable for retrieval of a schema (if any exists).” (Emphasis added) That implies that any correct application really should do its own URI resolution.

One thing that isn’t legitimate, but I’ve occasionally seen, is replacing a schema with a new and incompatible version under the same URI. That can cause serious trouble for files that use the old schema. A new version of a schema needs to have a new URI.

The schema situation creates problems for hosting sites, applications, and archives. It’s vital to remember that you can’t count on the URI’s being a valid URL in the long term.

If you’ve got one of those old versions of JHOVE (1.5 and older, I think), please upgrade. The new versions are a lot less buggy anyway.

Format conformity

By design JHOVE measures strict conformity to file format specifications. I’ve never been convinced this is the best way to measure a file’s viability or even correctness, but it’s what JHOVE does, and I’d just create confusion if I changed it now.

In general, the published specification is the best measure of a file’s correctness, but there are clearly exceptions, and correctness isn’t the same as viability for preservation. Let’s look at the rather extreme case of TIFF.

The current official specification of TIFF is Revision 6.0, dated June 3, 1992. The format hasn’t changed a byte in over 20 years — except that it has.

The specification says about value offsets in IFDs: “The Value is expected to begin on a word boundary; the corresponding Value Offset will thus be an even number.” This is a dead letter today. Much TIFF generation software freely writes values on any byte boundary, and just about all currently used readers accept them. JHOVE initially didn’t accept files with odd byte alignment as well-formed, but after numerous complaints it added a configuration option to allow them.

Over the years a body of apocrypha has grown around TIFF. Some comes from Adobe, some not. The titles of the ones from Adobe don’t clearly mark them as revisions to TIFF, but they are. The “Adobe PageMaker® 6.0 TIFF Technical Notes,” September 14, 1995, define the important concept of SubIFD, among other changes. The “Adobe Photoshop® TIFF Technical Notes,” March 22, 2002, define new tags and forms of compression. The “Adobe Photoshop® TIFF Technical Note 3,” April 8, 2005, adds new floating point types. The last one isn’t available, as far as I can tell, on Adobe’s own website, but it’s canonical.

Then there’s material without official Adobe approval. The JPEG compression defined in the 2002 tech notes is an official acceptance of a 1995 draft note that had already gained wide acceptance.

What’s the best measure of a TIFF file? That it corresponds strictly to TIFF 6.0? To 6.0 plus a scattered set of tech notes? Or that it’s processed correctly by LibTiff, a freely available and very widely used C library? To answer the question, we have to specify: Best for what? If we’re talking about the best chance of preservation, what scenarios are we envisioning?

One scenario amounts to a desert-island situation in which you have a specification, some files that you need to render, and a computer. You don’t have any software to go by. In this case, conformity to the spec is what you need, but it’s a rather unlikely scenario. If all existing TIFF readers disappear, things have probably gone so far that no one will be motivated to write a new one.

It’s more likely that people a few decades in the future will scramble to find software or entire old computers that will read obsolete formats. This doesn’t necessarily mean today’s software, but what we can read today can be a pretty good guide to what will be readable in the future. Insisting on conformity to the spec may be erring to the safe side, but if it excludes a large body of valuable files, it’s not a good choice.

Rather than insisting solely on conformity to a published standard, preservation-worthy files need to be measured by a balance between accepting files that will cause reading problems down the road and rejecting files that won’t. Multiple factors come into consideration, of which the spec is just one.