Category Archives: commentary

Figuring out the PDF version is harder than you think

In a GitHub comment, Johan van der Knijff noted how messy it is to determine the version of a PDF file. He looked at a file with the header characters “%PDF-1.8”. DROID says this isn’t a PDF file at all.

By a strict reading of the PDF specification, it isn’t. The version number has to be in the range 1.0 through 1.7. Being this strict seems like a bad idea, since it would mean format recognition software will fail to recognize any future versions of the format. (JHOVE doesn’t care what character comes after the period.)
Continue reading

Klingon vs. Emoji in Unicode

In 2001, the Unicode Consortium rejected a proposal to include the Klingon encoding. The reasons it gave were:

Lack of evidence of usage in published literature, lack of organized community interest in its standardization, no resolution of potential trademark and copyright issues, question about its status as a cipher rather than a script, and so on.

Fair enough, but don’t most of these objections apply equally to emoji?
Continue reading

How big is BigTIFF?

TIFF is a very popular image format, but it can’t handle really huge files. “Really huge” means files bigger than 4 gigabytes, or more precisely, files in which any data offset can’t be represented in 32 bits. That’s not a limitation that comes up often, but some applications, such as medical scans, need enough detail to push the limit.

A dozen years ago, members of the TIFF community at AWare Systems came up with a simple idea: Create a variant of TIFF with 64-bit offsets instead of 32 bits. The result was BigTIFF.
Continue reading

The strange state of “open” format documentation

You can legally download many specs from the ISO site, including the Open Document Format (ODF) specs. ISO lets you print out a copy. However, if you photocopy or scan it, or if you make it available on your organization’s LAN, the Copyright Police will haul you away.

I’ve seen similar restrictions elsewhere. They’re variations on the idea that you can download a document for free, but you can’t share it after you download it. It’s bizarre.

Maybe they’re trying to keep people from going into competition by selling copies of their standards. Since ISO also sells what it publishes, the goal would make sense. In fact, there’s a specific and emphatic prohibition on sales. But why they should care whether copies are printed or photocopied is beyond me.

Usually the answer to questions like these is “lawyers who are disconnected from reality.” If there’s a better answer, I’d love to hear it.

The Bitcoin blockchain format

The Bitcoin cryptocurrency depends on security and confidence. If a flaw in the design broke its trust or usability, the whole system would collapse.

It’s strange, then, that Bitcoin doesn’t have a specification. This is considered a feature, not a bug:
Continue reading

When is an algorithm not an algorithm?

The only time the news media use the term “algorithm,” it seems, is for computational methods that aren’t.

Merriam-Webster defines it as “a procedure for solving a mathematical problem (as of finding the greatest common divisor) in a finite number of steps that frequently involves repetition of an operation.” Let’s forget about repetition; almost every computational procedure uses loops. The key word is “mathematical.”

An algorithm produces results that can be mathematically verified. An algorithm for calculating pi will produce the known value to the needed level of precision, or it’s wrong. A search algorithm is an algorithm when its results correspond to precise matching criteria.
Continue reading

The decline and fall of Adobe Flash

It’s been a year since I last posted about Adobe Flash’s impending demise. Like everything else on the Internet, it won’t ever vanish completely, but its decline is accelerating.
Continue reading

Olympic file format capriciousness

This blog doesn’t generally deal with cronyist bullying operations like the International Olympic Committee (IOC). But when the IOC get silly about the file formats it tells people they can’t use, that’s a subject worth mentioning here.

The IOC has decreed that “the use of Olympic Material transformed into graphic animated formats such as animated GIFs (i.e. GIFV), GFY, WebM, or short video formats such as Vines and others, is expressly prohibited.”
Continue reading

Newspeak, emoji style

In Orwell’s 1984, the Newspeak language followed the principle that if you can abolish certain words, you can abolish the thoughts that go with them.

It was intended that when Newspeak had been adopted once and for all and Oldspeak forgotten, a heretical thought — that is, a thought diverging from the principles of Ingsoc — should be literally unthinkable, at least so far as thought is dependent on words. … This was done partly by the invention of new words, but chiefly by eliminating undesirable words and by stripping such words as remained of unorthodox meanings, and so far as possible of all secondary meanings whatever.

Apple is doing something like this with Unicode codepoint U+1F52B (🔫), which the code chart defines as PISTOL, with the explanatory text of “handgun, revolver.” There’s nothing that suggests it’s supposed to represent a water gun or any other kind of toy. However, Apple has elected to represent this character as a water pistol in iOS 10.
Continue reading

The persistence of old formats

Technologies develop to a point where they’re good enough for widespread use. Once a lot of people have adopted them, it’s hard to move on from there to a still better one, since people have invested so much in a technology that works for them. We see this with cell phone communication, which is pretty good but would undoubtedly be much better if it could be invented all over today. We see it with the DVD format, which Blu-Ray hasn’t managed to push aside in spite of huge marketing efforts. And we see it in file formats.

Most of today’s highly popular formats have been around since the nineties. For images, we still have TIFF, JPEG, PNG, and even the primitive GIF format, which goes back to the eighties. In audio, MP3 still dominates, even though there are now much better alternatives.

This is a good thing in many ways. If new, improved formats displaced old ones every five years, we’d be constantly investing in new software, and anyone who didn’t upgrade would be unable to read a lot of new files. Digital preservation would be a big headache, as archivists would need to migrate files repeatedly to avoid obsolescence.

It does mean, though, that we’re working with formats that have deficiencies which often have grown in importance. JPEG compression isn’t nearly as good as what modern techniques can manage. MP3 is encumbered with patents and offers sound quality that’s inferior to other lossy audio formats. HTML has improved through major revisions, but it’s still a mess to validate. For that matter, we have formats like “English,” which lacks any spec and is a pile of kludges that have accumulated over centuries. Try finding support for supposed improvements such as Esperanto anywhere.

It’s a situation we just have to live with. The good enough hangs on, and the better has a hard time getting acceptance. Considering how unstable the world of data would be if this weren’t the case, it’s a good thing on the whole.