Category Archives: commentary

The police body camera data problem

The Washington Post reports that some police departments are dropping body camera programs because of the expense. I’ll admit that my first gut reaction on seeing the story was that it’s just an excuse. In some cases it probably is. But it’s a fact that while the cameras are cheap, storing and managing large amounts of video data isn’t. The question needs objective examination.
Continue reading

Canvas fingerprinting in Web pages

The array of sneaky tricks to get past Internet users’ veil of privacy is astonishing. At least it would be, if we weren’t all past the capacity for astonishment. One which has been around for years is Canvas fingerprinting. It lets servers narrow your profile down to a small number of clients. Combined with other measures, it can uniquely identify you.

How Canvas works

Canvas wasn’t designed to spy on you. It’s a way to draw graphics very efficiently in a browser. It supports animation and interaction. In order to get fast performance, it allows hardware acceleration and doesn’t mandate the exact set of pixels to be drawn. The server can then get those pixels back using getImageData() or toDataURL() in the Canvas API.
Continue reading

FUIF: Yet another image format?

A tweet led me to a pair of articles about a new file format called FUIF. That stands for “Free Universal Image Format.” Jon Sneyers describes it in a series of articles which so far include a Part 1 and Part 2.

It’s “responsive by design”; a single image file can be truncated at various offsets to produce different resolutions. Sneyers says FUIF meets JPEG’s criteria for a new format that provides “efficient coding of images with text and graphics” and “very low file size image coding.”
Continue reading

The great GIF pronunciation debate

Of all the issues in file formats, the pronunciation of “GIF” is surely close to the bottom in importance. When an issue is that minor, you can be sure everyone has strong opinions on it and will defend them on the barricades. It’s like the way political movements work: the closer together they are in their beliefs, the more ferociously they’ll vilify each other over little differences.

Personally, I always pronounce it my mind with a hard “G,” as in “give” rather than “giraffe.” I’m glad to see some support for this view in “A Linguist’s Guide to Pronouncing ‘GIF’.” One of its arguments matches the main reason in my mind: the “G” stands for “graphics,” which is pronounced with a hard “G.”

Case closed. Now can we agree that “PNG” is pronounced “Pee-Enn-Gee,” and not “Ping”?

Why does one PDF display and another one download?

Sometimes when you click on a link to a PDF, it comes up in the browser. Other times, the browser downloads the file. Everyone must wonder why, but few have wondered enough to find out. Here’s a quick explanation.

It has nothing to do with the PDF version, the content of the file, or the link. It’s the HTTP headers that make the difference. Specifically, a header called “Content-Disposition” is the determining factor. If it’s absent, the file will open in the browser. If it’s present, the value it specifies determines how you get the file.
Continue reading

The digital preservation song challenge!

Should there be songs about digital preservation? This is just a special case of the question, “Should there be songs about X?” For nearly all X, the answer is “Yes, and there probably are!” (Even — perhaps especially — if there shouldn’t be, there are.)

Someone in the Australiasian preservation community asked if AusPreserves needed a theme song. The first responses were existing popular songs, but then people started getting more creative. This led to the Digital Preservation Song Challenge!

One response was the Beyonce parody, “All the Corrupt Files” (“Put a checksum on it”). I think it’s the first song ever to mention JHOVE!

Naturally, I already have my own song on digital preservation, called Files that Last. I wrote it to promote my book of the same title, but it stands (or falls) by itself.

If it’s worth doing, it’s worth singing about, and that certainly applies to digital preservation!

Fact-checking the GIF format

The Politifact article on the White House’s video “evidence” against reporter Jim Acosta looked plausible enough to me, until I got to the explanation of GIF files. It got significant points wrong, following common misunderstandings.

The regular readers of this blog mostly know what GIF really is, but this article may be a useful reference if you need to explain to anyone. The Politifact article says:
Continue reading

How to approach the file format validation problem

For years I wrote most of the code for JHOVE. With each format, I wrote tests for whether a file is “well-formed” and “valid.” With most formats, I never knew exactly what these terms meant. They come from XML, where they have clear meanings. A well-formed XML file has correct syntax. Angle brackets and quote marks match. Closing tags match opening tags. A valid file is well-formed and follows its schema. A file can be well-formed but not valid, but it can’t be valid without being well-formed.

With most other formats, there’s no definition of these terms. JHOVE applies them anyway. (I wrote the code, but I didn’t design JHOVE’s architecture. Not my fault.) I approached them by treating “well-formed” as meaning syntactically correct, and “valid” as meaning semantically correct. Drawing the line wasn’t always easy. If a required date field is missing, is the file not well-formed or just not valid? What if the date is supposed to be in ISO 8601 format but isn’t? How much does it matter?
Continue reading

Emoji interoperability (or its lack)

Unicode characters ought to have a specific denotation, even if their exact appearance depends on the font. A letter, a punctuation mark, or a Chinese ideograph should have the same meaning to everyone who reads it. There are problems, of course. There’s no systematic difference in appearance between A, the first letter of the Roman alphabet, and Α, Alpha, the first letter of the Greek alphabet. (However, when I had my computer read this article aloud to me for proofreading, it pronounced the latter as “Greek capital letter alpha”! Nice! It also pronounced the names of the emoji in this article, except the new ones in Unicode 11.0.) In some fonts, you can’t even tell the lower case letter l from the number 1 without looking carefully. This problem allows homograph attacks and “typosquatting.”

But the worst problem is with the Unicode Consortium’s great headache, emoji. These picture characters have just brief verbal descriptions in the Unicode standard, and font designers for different companies produce renderings that have vastly different connotations. Motherboard offers a sampling of the varied renderings. Here’s the “grimacing face” from Apple, Google, Samsung, and LG respectively.
Continue reading

Data Transfer Project: New models for interoperability

In spite of improved file standardization, interoperability of data is often a challenge. Say you’ve got a collection of pictures on Photobucket and you want to move them to a different site. You’ve got a lot of manual work ahead. It would be great if there were a tool to do it all for you. The Data Transfer Project aims at making that possible. Some big names are behind it: Facebook, Google, Microsoft, and Twitter. The basic approach is straightforward:

The DTP is powered by an ecosystem of adapters (Adapters) that convert a range of proprietary formats into a small number of canonical formats (Data Models) useful for transferring data. This allows data transfer between any two providers using the provider’s existing authorization mechanism, and allows each provider to maintain control over the security of their service.

Continue reading