Embracing the chaos of formats

We often think of formats in terms of specifications and standards, and this can be a useful thing. If you want to know exactly what the Latin-1 encoding is, you can look at the ISO-8859-1 standard and it will tell you. However, this isn’t always a reliable guide to what’s out there. Someone noticed that ISO-8859 reserves lots of control codes that are rarely used and put additional printing characters there. This got codified as well, as Windows 1252 (which Microsoft falsely claims as an ANSI standard), but there are many ad hoc or obscure encodings which are hard or impossible to find references for.

Earth’s official authorities refused to grant the Klingons a place in Unicode for their characters; nonetheless, there is an unofficial registry that uses part of the Unicode Private Use Area for Klingon and other constructed scripts. Is it official Unicode? No. If you use code points F8D0-F8FF, will others recognize them as Klingon characters? Sometimes.

I’ve written about the TIFF situation before. The TIFF 6.0 spec is an insufficient guide to today’s real-life TIFF. You have to go through scattered tech notes to understand how it’s really used.

Understanding situations like these requires understanding that formats don’t flow unchanged from the minds of their designers to their implementation in the world’s computers. People change things to meet their needs. This makes them more useful for some purposes; at the same time, it makes them more confusing. The only alternative would be to create a format police force with the power to arrest and punish innovators.

The situation is analogous to natural language. You can insist that anything that disagrees with the grammar books is wrong, but if everybody talks that way, there ain’t no stoppin’ it. At the same time, the grammar books put a brake on unnecessary change, keeping the language from breaking down into a thousand mutually unintelligible dialects.

Digital preservationists have to look at the actual usage of formats, not just their official specifications. This doesn’t mean that they should accept every deviation, but they need to acknowledge changes that have become de facto standards. Context matters; an archive of ninteenth-century literature doesn’t have to be concerned with Klingon characters, but an archive of science fiction fan literature had better take them into account. Even an occasional scholarly paper might have a word or two in the pIqaD script.

This proliferation of variants is a big part of why centralized registries of format information don’t work. Not only is there too much information, it keeps changing. The best we can hope for is a coordinated way of finding our way through a chaotic body of information.

One response to “Embracing the chaos of formats

  1. Hi Gary,

    Great post. I completely agree that
    “Digital preservationists have to look at the actual usage of formats, not just their official specifications.”
    and that
    “they need to acknowledge changes that have become de facto standards”.
    Just the other day I had a discussion with the group on the #justsolve IRC channel about what defines a format. I was suggesting that potentially every piece of software will write out every format slightly differently if it uses different code.
    We agreed that the best approach to dealing with this is to document the cases where we know that there is a change that warrants “de-facto standard” status, e.g. <a href="http://justsolve.archiveteam.org/index.php/ODS&quot; ODS files created by Excel 2007 and for now, not assume that every piece of software will write them out differently. I tend to think we should assume they do until we know they don’t as that is less risky longer term, however doing that would also involve the creation of many, many, many more de-facto standards so we went with the option outlined.
    I suspect that in many cases it will be the code for the software that writes out the files that will provide the best documentation of the formats.
    In regards to the potential issues with stymying innovation, I referred to that (briefly) in my answer here also. It has led me to believe that a consequence of the complexity amongst possible format implementations is that migration strategies will turn out to be too expensive to implement and verify. Instead an emulation-based strategy will likely be more feasible.

    Thanks again for the interesting post,

    Euan