A field guide to “plain text”

In some ways, plain text is the best preservation format. It’s simple and easily identified. It’s resilient when damaged; if a file is half corrupted, the other half is still readable. There’s just the little problem: What exactly is plain text?

ASCII is OK for English, if you don’t have any accented words, typographic quotes, or fancy punctuation. It doesn’t work very well for any other language. It even has problems outside the US, such as the lack of a pound sterling symbol; there’s a reason some people prefer the name US-ASCII. You’ll often find that supposed “ASCII” text has characters outside the 7-bit range, just enough of them to throw you off. Once this happens, it can be very hard to tell what encoding you’ve got.

Even if text looks like ASCII and doesn’t have any high bits set, it could be one of the other encodings of the ISO 646 family. These haven’t been used much since ISO 8859 came out in the late eighties, but you can still run into old text documents that use it. Since all the members of the family are seven-bit code and differ from ASCII in just a few characters, it’s easy to mistake, say, a French ISO-646 file for ASCII and turn all the accented e’s into curly braces. (I won’t get into prehistoric codes like EBCDIC, which at least can’t be mistaken for anything else.)

The ISO 8859 encodings have the same problem, pushed to the 8-bit level. If you’re in the US or western Europe and come upon 8-bit text which doesn’t work as UTF-8, you’re likely to assume it’s ISO 8859-1, aka Latin-1. There are, however, over a dozen variants of 8859. Some are very different in codes above 127, but some have only a few differences. ISO 8859-9 (Latin-5 or “Turkish Latin-1”) and ISO 8859-15 (Latin-9) are very similar. Microsoft added to the confusion with the Windows 1252 encoding, which turns some control codes in Latin-1 into printing characters. It used to be common to claim 1252 was an ANSI standard, even though it never was.

UTF-8, even without a byte order mark (BOM), has a good chance of being recognized without a lot of false positives; if a text file has characters with the high bit set and an attempt to decode it as UTF-8 doesn’t result in errors, it most likely is UTF-8. (I’m not discussing UTF-16 and 32 here because they don’t look at all ASCII-like.) Even so, some ISO 8859 files can look like good UTF-8 and vice versa.

So plain text is really simple — or maybe not.

Unicode

Words: Gary McGath, Copyright 2003
Music: Shel Silverstein, “The Unicorn”

A long time ago, on the old machines,
There were more kinds of characters than you’ve ever seen.
Nobody could tell just which set they had to load,
They wished that somehow they could have one kind of code.

   There was US-ASCII, simplified Chinese,
   Arabic and Hebrew and Vietnamese,
   And Latin-1 and Latin-2, but don’t feel snowed;
   We’ll put them all together into Unicode.

The users saw this Babel and it made them blue,
So a big consortium said, “This is what we’ll do:
We will take this pile of sets and give each one its place,
Using sixteen bits or thirty-two, we’ve lots of space

   For the US-ASCII, simplified Chinese,
   Arabic and Hebrew and Vietnamese,
   And Latin-1 and Latin-2, we’ll let them load
   In a big set of characters called Unicode.

The Klingons arrived when they heard the call,
And they saw the sets of characters, both big and small.
They said to the consortium, “Here’s what we want:
Just a little bit of space for the Klingon font.”

   “You’ve got US-ASCII, simplified Chinese,
   Arabic and Hebrew and Vietnamese,
   And Latin-1 and Latin-2, but we’ll explode
   You if you don’t put Klingon characters in Unicode.”

The Unicode Consortium just shook their heads,
Though the looks that they were getting caused a sense of dread.
“The set that we’ve assembled is for use on Earth,
And a foreign planet is the Klingons’ place of birth.”

   We’ve got US-ASCII, simplified Chinese,
   Arabic and Hebrew and Vietnamese,
   And Latin-1 and Latin-2, but you can’t goad
   Us into putting Klingon characters in Unicode.

The Klingons grew as angry as a minotaur;
They went back to their spaceship and declared a war.
Three hundred years ago this happened, but they say
That’s why the Klingons still despise the Earth today.

   We’ve got US-ASCII, simplified Chinese,
   Tellarite and Vulcan and Vietnamese,
   And Latin-1 and Latin-2, but we’ll be blowed
   If we’ll put the Klingon language into Unicode.

Comments are closed.