When is an algorithm not an algorithm?

The only time the news media use the term “algorithm,” it seems, is for computational methods that aren’t.

Merriam-Webster defines it as “a procedure for solving a mathematical problem (as of finding the greatest common divisor) in a finite number of steps that frequently involves repetition of an operation.” Let’s forget about repetition; almost every computational procedure uses loops. The key word is “mathematical.”

An algorithm produces results that can be mathematically verified. An algorithm for calculating pi will produce the known value to the needed level of precision, or it’s wrong. A search algorithm is an algorithm when its results correspond to precise matching criteria.
Continue reading

The decline and fall of Adobe Flash

It’s been a year since I last posted about Adobe Flash’s impending demise. Like everything else on the Internet, it won’t ever vanish completely, but its decline is accelerating.
Continue reading

Olympic file format capriciousness

This blog doesn’t generally deal with cronyist bullying operations like the International Olympic Committee (IOC). But when the IOC get silly about the file formats it tells people they can’t use, that’s a subject worth mentioning here.

The IOC has decreed that “the use of Olympic Material transformed into graphic animated formats such as animated GIFs (i.e. GIFV), GFY, WebM, or short video formats such as Vines and others, is expressly prohibited.”
Continue reading

Newspeak, emoji style

In Orwell’s 1984, the Newspeak language followed the principle that if you can abolish certain words, you can abolish the thoughts that go with them.

It was intended that when Newspeak had been adopted once and for all and Oldspeak forgotten, a heretical thought — that is, a thought diverging from the principles of Ingsoc — should be literally unthinkable, at least so far as thought is dependent on words. … This was done partly by the invention of new words, but chiefly by eliminating undesirable words and by stripping such words as remained of unorthodox meanings, and so far as possible of all secondary meanings whatever.

Apple is doing something like this with Unicode codepoint U+1F52B (🔫), which the code chart defines as PISTOL, with the explanatory text of “handgun, revolver.” There’s nothing that suggests it’s supposed to represent a water gun or any other kind of toy. However, Apple has elected to represent this character as a water pistol in iOS 10.
Continue reading

Work on TI/A quietly continues

The work on the TI/A project, to define an archive-friendly version of TIFF analogous to PDF/A, is still going, even though hardly any of it is publicly visible. Marisa Pfister’s leaving the project, along with her position at the University of Basel, was unfortunate, but others are continuing a detailed analysis of TIFF files used at various archives. This will help them to learn what features and tags are used.

The target of March 1, 2016, for a submission to ISO has been crossed out, and nothing has replaced it, but we can still hope it will happen.

The persistence of old formats

Technologies develop to a point where they’re good enough for widespread use. Once a lot of people have adopted them, it’s hard to move on from there to a still better one, since people have invested so much in a technology that works for them. We see this with cell phone communication, which is pretty good but would undoubtedly be much better if it could be invented all over today. We see it with the DVD format, which Blu-Ray hasn’t managed to push aside in spite of huge marketing efforts. And we see it in file formats.

Most of today’s highly popular formats have been around since the nineties. For images, we still have TIFF, JPEG, PNG, and even the primitive GIF format, which goes back to the eighties. In audio, MP3 still dominates, even though there are now much better alternatives.

This is a good thing in many ways. If new, improved formats displaced old ones every five years, we’d be constantly investing in new software, and anyone who didn’t upgrade would be unable to read a lot of new files. Digital preservation would be a big headache, as archivists would need to migrate files repeatedly to avoid obsolescence.

It does mean, though, that we’re working with formats that have deficiencies which often have grown in importance. JPEG compression isn’t nearly as good as what modern techniques can manage. MP3 is encumbered with patents and offers sound quality that’s inferior to other lossy audio formats. HTML has improved through major revisions, but it’s still a mess to validate. For that matter, we have formats like “English,” which lacks any spec and is a pile of kludges that have accumulated over centuries. Try finding support for supposed improvements such as Esperanto anywhere.

It’s a situation we just have to live with. The good enough hangs on, and the better has a hard time getting acceptance. Considering how unstable the world of data would be if this weren’t the case, it’s a good thing on the whole.

The steep road to supporting the PDF format

A lot of applications claim they can display PDF files, but not all of them fully support the format. They won’t necessarily display all valid files correctly. The PDF Association has an article discussing this problem, with the main focus on the Microsoft Edge browser.

Edge offers only partial support for the JBIG2Decode and JPXDecode filters, which means some objects might not display. It doesn’t support certain types of shadings, so other objects could render incorrectly.

The strength of PDF is supposed to be that it will render the same way everywhere. You can blame Microsoft for not putting enough work into it, or Adobe for making the format too complex. I have enough experience with it to know it’s a seriously difficult format just to analyze, to say nothing of rendering. Is a format which presents such difficulties really the ideal for a universal document rendering format that people will count on far into the future?

Update: It gets worse. Take a look at this discussion of what’s in PDF.

Unicode 9.0

The Unicode Consortium has announced the release of Unicode 9.0. It adds character sets for some little-known languages, including Osage, Nepal Bhasa, Fulani, the Bravanese dialect of Swahili, the Warsh orthography for Arabic, and Tangut. It updates the collation specification and security recommendations.

Most Unicode implementations will require just font upgrades, but full support of some of the more unusual scripts will require attention to the migration notes.

“Asymmetric case mapping” sounds interesting. I believe this means that the conversion between upper case and lower case isn’t one-to-one and reversible. The notes give the example of “the asymmetric case mapping of Greek final sigma to capital sigma.” Lowercase sigma has two forms; it’s σ except at the end of a word, where it’s ς. Both turn into Σ in uppercase.

What really has people excited about Unicode 9, if a Startpage search is any indication, isn’t any of these things, but that about 1% of the new characters are emoji and that Apple and Microsoft lobbied against one candidate emoji. I wonder if the Unicode Consortium regrets having gotten involved in that mess in the first place. There are no possible criteria except whims for what the set should include. There’s no limit on how many could be added. OK, having a universally set of encodings promotes information interchange, but the tail is wagging the 🐕.

By the way, what’s the plural of “emoji”? I use “emoji” as both singular and plural, but I’m seeing “emojis” with increasing frequency. It just looks wrong to me. Does anyone say “kanjis” or “romajis” for the other Japanese character sets? I had to argue with the editor to keep the title of my article “The War on Emoji” that way.

Don’t hide those file extensions!

Lately I’ve ghostwritten several pieces on Internet security and how to protect yourself against malicious files. One point comes up over and over: Don’t hide file extensions! If you get a file called Evilware.pdf.exe, then Microsoft thinks you should see it as Evilware.pdf. The default setting on Windows conceals file extensions from you; you have to change a setting to view files by their actual names.

What’s this supposed to accomplish, besides making you think executable files are just documents? I keep seeing vague statements that this somehow “simplifies” things for users. If they see a file called “Document.pdf,” Microsoft’s marketing department thinks people will say, “What’s that .pdf at the end of the name? This is too bewildering and technical for me! I give up on this computer!”

They also seem to think that when people run a .exe file, not knowing it is one because the extension is hidden, and it turns out to be ransomware that encrypts all the files on the computer, that’s a reasonable price to pay for making file names look simpler. It’s always marketing departments that are to blame for this kind of stupidity; I’m sure the engineers know better.
Continue reading

APFS, Apple’s replacement for HFS+

Apple is introducing a new file system to replace the twentieth-century HFS+. The new one is called APFS, which simply stands for “Apple File System.” When Apple released HFS+, disk sizes were measured in megabytes, not terabytes.

New features include 64-bit inode numbers, nanosecond timestamp granularity, and native support for encryption. Ars Technica offers a discussion of the system, which is still in an experimental state.
Continue reading