We often think of formats in terms of specifications and standards, and this can be a useful thing. If you want to know exactly what the Latin-1 encoding is, you can look at the ISO-8859-1 standard and it will tell you. However, this isn’t always a reliable guide to what’s out there. Someone noticed that ISO-8859 reserves lots of control codes that are rarely used and put additional printing characters there. This got codified as well, as Windows 1252 (which Microsoft falsely claims as an ANSI standard), but there are many ad hoc or obscure encodings which are hard or impossible to find references for.
Earth’s official authorities refused to grant the Klingons a place in Unicode for their characters; nonetheless, there is an unofficial registry that uses part of the Unicode Private Use Area for Klingon and other constructed scripts. Is it official Unicode? No. If you use code points F8D0-F8FF, will others recognize them as Klingon characters? Sometimes.
I’ve written about the TIFF situation before. The TIFF 6.0 spec is an insufficient guide to today’s real-life TIFF. You have to go through scattered tech notes to understand how it’s really used.
Understanding situations like these requires understanding that formats don’t flow unchanged from the minds of their designers to their implementation in the world’s computers. People change things to meet their needs. This makes them more useful for some purposes; at the same time, it makes them more confusing. The only alternative would be to create a format police force with the power to arrest and punish innovators.
The situation is analogous to natural language. You can insist that anything that disagrees with the grammar books is wrong, but if everybody talks that way, there ain’t no stoppin’ it. At the same time, the grammar books put a brake on unnecessary change, keeping the language from breaking down into a thousand mutually unintelligible dialects.
Digital preservationists have to look at the actual usage of formats, not just their official specifications. This doesn’t mean that they should accept every deviation, but they need to acknowledge changes that have become de facto standards. Context matters; an archive of ninteenth-century literature doesn’t have to be concerned with Klingon characters, but an archive of science fiction fan literature had better take them into account. Even an occasional scholarly paper might have a word or two in the pIqaD script.
This proliferation of variants is a big part of why centralized registries of format information don’t work. Not only is there too much information, it keeps changing. The best we can hope for is a coordinated way of finding our way through a chaotic body of information.
Embracing the chaos of formats
We often think of formats in terms of specifications and standards, and this can be a useful thing. If you want to know exactly what the Latin-1 encoding is, you can look at the ISO-8859-1 standard and it will tell you. However, this isn’t always a reliable guide to what’s out there. Someone noticed that ISO-8859 reserves lots of control codes that are rarely used and put additional printing characters there. This got codified as well, as Windows 1252 (which Microsoft falsely claims as an ANSI standard), but there are many ad hoc or obscure encodings which are hard or impossible to find references for.
Earth’s official authorities refused to grant the Klingons a place in Unicode for their characters; nonetheless, there is an unofficial registry that uses part of the Unicode Private Use Area for Klingon and other constructed scripts. Is it official Unicode? No. If you use code points F8D0-F8FF, will others recognize them as Klingon characters? Sometimes.
I’ve written about the TIFF situation before. The TIFF 6.0 spec is an insufficient guide to today’s real-life TIFF. You have to go through scattered tech notes to understand how it’s really used.
Understanding situations like these requires understanding that formats don’t flow unchanged from the minds of their designers to their implementation in the world’s computers. People change things to meet their needs. This makes them more useful for some purposes; at the same time, it makes them more confusing. The only alternative would be to create a format police force with the power to arrest and punish innovators.
The situation is analogous to natural language. You can insist that anything that disagrees with the grammar books is wrong, but if everybody talks that way, there ain’t no stoppin’ it. At the same time, the grammar books put a brake on unnecessary change, keeping the language from breaking down into a thousand mutually unintelligible dialects.
Digital preservationists have to look at the actual usage of formats, not just their official specifications. This doesn’t mean that they should accept every deviation, but they need to acknowledge changes that have become de facto standards. Context matters; an archive of ninteenth-century literature doesn’t have to be concerned with Klingon characters, but an archive of science fiction fan literature had better take them into account. Even an occasional scholarly paper might have a word or two in the pIqaD script.
This proliferation of variants is a big part of why centralized registries of format information don’t work. Not only is there too much information, it keeps changing. The best we can hope for is a coordinated way of finding our way through a chaotic body of information.
1 Comment
Posted in commentary
Tagged preservation, standards