PDF/A-3

The latest version of PDF/A, a subset of PDF suitable for long-term archiving, is now available as ISO standard 19005-3:2012. According to the PDF/A Association Newsletter, “there is only one new feature with PDF/A-3, namely that any source format can be embedded in a PDF/A file.”

This strikes me as a really bad idea. The whole point of PDF/A is to restrict content to a known, self-contained set of options. The new version provides a back door that allows literally anything. The intent, according to the article, is to let archivists save documents in their original format as well as their PDF representation. Certainly saving the originals is a good archiving practice, but it should be done in an archival package, not in a PDF format designed for archiving.

Mission creep afflicts projects of all kinds, and this is a case in point.

A field guide to “plain text”

In some ways, plain text is the best preservation format. It’s simple and easily identified. It’s resilient when damaged; if a file is half corrupted, the other half is still readable. There’s just the little problem: What exactly is plain text?

ASCII is OK for English, if you don’t have any accented words, typographic quotes, or fancy punctuation. It doesn’t work very well for any other language. It even has problems outside the US, such as the lack of a pound sterling symbol; there’s a reason some people prefer the name US-ASCII. You’ll often find that supposed “ASCII” text has characters outside the 7-bit range, just enough of them to throw you off. Once this happens, it can be very hard to tell what encoding you’ve got.

Even if text looks like ASCII and doesn’t have any high bits set, it could be one of the other encodings of the ISO 646 family. These haven’t been used much since ISO 8859 came out in the late eighties, but you can still run into old text documents that use it. Since all the members of the family are seven-bit code and differ from ASCII in just a few characters, it’s easy to mistake, say, a French ISO-646 file for ASCII and turn all the accented e’s into curly braces. (I won’t get into prehistoric codes like EBCDIC, which at least can’t be mistaken for anything else.)

The ISO 8859 encodings have the same problem, pushed to the 8-bit level. If you’re in the US or western Europe and come upon 8-bit text which doesn’t work as UTF-8, you’re likely to assume it’s ISO 8859-1, aka Latin-1. There are, however, over a dozen variants of 8859. Some are very different in codes above 127, but some have only a few differences. ISO 8859-9 (Latin-5 or “Turkish Latin-1”) and ISO 8859-15 (Latin-9) are very similar. Microsoft added to the confusion with the Windows 1252 encoding, which turns some control codes in Latin-1 into printing characters. It used to be common to claim 1252 was an ANSI standard, even though it never was.

UTF-8, even without a byte order mark (BOM), has a good chance of being recognized without a lot of false positives; if a text file has characters with the high bit set and an attempt to decode it as UTF-8 doesn’t result in errors, it most likely is UTF-8. (I’m not discussing UTF-16 and 32 here because they don’t look at all ASCII-like.) Even so, some ISO 8859 files can look like good UTF-8 and vice versa.

So plain text is really simple — or maybe not.

Unicode

Words: Gary McGath, Copyright 2003
Music: Shel Silverstein, “The Unicorn”

A long time ago, on the old machines,
There were more kinds of characters than you’ve ever seen.
Nobody could tell just which set they had to load,
They wished that somehow they could have one kind of code.

   There was US-ASCII, simplified Chinese,
   Arabic and Hebrew and Vietnamese,
   And Latin-1 and Latin-2, but don’t feel snowed;
   We’ll put them all together into Unicode.

The users saw this Babel and it made them blue,
So a big consortium said, “This is what we’ll do:
We will take this pile of sets and give each one its place,
Using sixteen bits or thirty-two, we’ve lots of space

   For the US-ASCII, simplified Chinese,
   Arabic and Hebrew and Vietnamese,
   And Latin-1 and Latin-2, we’ll let them load
   In a big set of characters called Unicode.

The Klingons arrived when they heard the call,
And they saw the sets of characters, both big and small.
They said to the consortium, “Here’s what we want:
Just a little bit of space for the Klingon font.”

   “You’ve got US-ASCII, simplified Chinese,
   Arabic and Hebrew and Vietnamese,
   And Latin-1 and Latin-2, but we’ll explode
   You if you don’t put Klingon characters in Unicode.”

The Unicode Consortium just shook their heads,
Though the looks that they were getting caused a sense of dread.
“The set that we’ve assembled is for use on Earth,
And a foreign planet is the Klingons’ place of birth.”

   We’ve got US-ASCII, simplified Chinese,
   Arabic and Hebrew and Vietnamese,
   And Latin-1 and Latin-2, but you can’t goad
   Us into putting Klingon characters in Unicode.

The Klingons grew as angry as a minotaur;
They went back to their spaceship and declared a war.
Three hundred years ago this happened, but they say
That’s why the Klingons still despise the Earth today.

   We’ve got US-ASCII, simplified Chinese,
   Tellarite and Vulcan and Vietnamese,
   And Latin-1 and Latin-2, but we’ll be blowed
   If we’ll put the Klingon language into Unicode.

JHOVE format notes

New on my business website: JHOVE format notes.

Preservation in the geek mainstream

Digital preservation issues are gaining notice in the geek mainstream, the large body of people who are computer-savvy but don’t live in the library-archive niche. Today we have an article in The Register, “British library tracks rise and fall of file formats.” It cites the British Library’s Andy Jackson, supporting the view that file formats remain usable for many years, even if they’re no longer the latest thing.

The Register article is short but nicely done. It naturally skips over issues which Andy’s original article deals with, like just how you reliably determine the formats of files. What’s significant is that it shows that concern about the long-term usability of files isn’t just a concern of a few specialists.

What happened to ID3.org?

I’ve been doing some work today on extraction of ID3 metadata from audio files, and I noticed that id3.org is currently a squatter site. Search engines still point at it for ID3-related queries, so I assume this is a relatively recent event. Does anyone know what happened?

The whois info says it’s registered by “Domain Privacy Group,” an operation in Burlington, Mass., with an invalid HTTPS certificate and a secretive website. The last change to the domain registration was pretty recent, on October 2, 2012.

The URI namespace problem

Tying XML schemas to URIs was the worst mistake in the history of XML. Once you publish a schema URI and people start using it, you can’t change it without major disruption.

URIs aren’t permanent. Domains can disappear or change hands. Even subdomains can vanish with organizational changes. When I was at Harvard, I offered repeated reminders that hul.harvard.edu can’t go away with the deprecation of the name “Harvard University Library/Libraries,” since it houses schemas for JHOVE and other applications. Time will tell whether it will stay.

Strictly speaking, a URI is a Uniform Resource identifier and has no obligation to correspond to a web page; W3C says a URI as a schema identifier is only a name. In practice, treating it as a URL may be the only way to locate the XSD. When a URI uses the http scheme, it’s an invitation to use it as a URL.

Even if a domain doesn’t go away, it can be burdened with schema requests beyond its hosting capacity. The Harvard Library has been trying to get people to upgrade to the current version of JHOVE, which uses an entity resolver, but its server was, the last I checked, still heavily hit by three sites that hadn’t upgraded. They don’t pay anything, so there’s no money to put into more server capacity.

The best solution available is for software to resolve schema names to local copies (e.g. with Java’s EntityResolver). This solution often doesn’t occur to people until there’s a problem, though, and by then there may be lots of copies of the old software out in the field.

For archival storage, keeping a copy of any needed schema files should be a requirement. Resources inevitably disappear from the Web, including schemas. My impression is that a lot of digital archives don’t have such a rule and blithely assume that the resources will be available on the Web forever. This is a risk which could be eliminated at virtually zero cost, and it should be, but my impression is that a lot of archives don’t do this.

It’s legitimate to stop making a URI usable as a URL, though it may be rude. W3C’s Namespaces in XML 1.0 says: “The namespace name, to serve its intended purpose, SHOULD have the characteristics of uniqueness and persistence. It is not a goal that it be directly usable for retrieval of a schema (if any exists).” (Emphasis added) That implies that any correct application really should do its own URI resolution.

One thing that isn’t legitimate, but I’ve occasionally seen, is replacing a schema with a new and incompatible version under the same URI. That can cause serious trouble for files that use the old schema. A new version of a schema needs to have a new URI.

The schema situation creates problems for hosting sites, applications, and archives. It’s vital to remember that you can’t count on the URI’s being a valid URL in the long term.

If you’ve got one of those old versions of JHOVE (1.5 and older, I think), please upgrade. The new versions are a lot less buggy anyway.

HTML5 schedule

The HTML Working Group Chairs and the Protocols and Formats WG Chair have proposed a plan for making HTML5 a Recommendation by the end of 2014. Features would be postponed to subsequent releases as necessary.

Accomplishing this, of course, requires that the proposal be accepted by the end of 2014.

Spruce Awards: signal boost and self-promotion

Applications for SPRUCE Awards are now open.

SPRUCE will make awards of up to £5k available for further developing the practical digital preservation outcomes and/or development of digital preservation business cases, that were begun in SPRUCE events. Applications from others may also be considered, but in this case, please discuss your proposal with SPRUCE before submission. A total fund of £60k is available for making these awards, which will be allocated in a series of funding calls thoughout the life of the SPRUCE Project.

The current (open) call is primarily for attendees of the SPRUCE Mashup London.

Awards must be submitted by 5 PM (GMT, I suppose) on October 10, 2012.

The self-promotion part: Awards are made to teams affiliated with institutions, but they are permitted to use outside help, since in-house developers may already be fully committed. As an independent developer with expertise in file formats and digital preservation, I’d like it known that I’m available to contract for carrying out a SPRUCE project. My business home page describes my background and skills. Paul Wheatley has told me this is a possibility, so I’m not just coming out of the blue with this offer.

My schedule may change, of course, but if you contact me on a project I’ll keep you updated on my status, and I’ll follow through in full on any commitment I make.

Format registry browser updated

I’ve posted an updated version of my file format registry browser (Zip file). It’s still very experimental, but this one makes several steps on the long path to being a useful tool.

The biggest news is that thanks to David Underdown’s input, it now talks to PRONOM. Preserv2 is probably a lost cause, since it appears no current work is being done on it and its useful results were folded back into PRONOM. This version tries to prettify results containing URIs and “@en” tags. If you don’t like that you can turn it off with a check box. The search fields have changed, and all of them now do something with all registries. The logging level can now be controlled from the config file (src/com/mcgath/regbrowser/config.properties). In some future version I’ll use a less buried config file.

Here’s my post on the first release.