Category Archives: commentary

The persistence of old formats

Technologies develop to a point where they’re good enough for widespread use. Once a lot of people have adopted them, it’s hard to move on from there to a still better one, since people have invested so much in a technology that works for them. We see this with cell phone communication, which is pretty good but would undoubtedly be much better if it could be invented all over today. We see it with the DVD format, which Blu-Ray hasn’t managed to push aside in spite of huge marketing efforts. And we see it in file formats.

Most of today’s highly popular formats have been around since the nineties. For images, we still have TIFF, JPEG, PNG, and even the primitive GIF format, which goes back to the eighties. In audio, MP3 still dominates, even though there are now much better alternatives.

This is a good thing in many ways. If new, improved formats displaced old ones every five years, we’d be constantly investing in new software, and anyone who didn’t upgrade would be unable to read a lot of new files. Digital preservation would be a big headache, as archivists would need to migrate files repeatedly to avoid obsolescence.

It does mean, though, that we’re working with formats that have deficiencies which often have grown in importance. JPEG compression isn’t nearly as good as what modern techniques can manage. MP3 is encumbered with patents and offers sound quality that’s inferior to other lossy audio formats. HTML has improved through major revisions, but it’s still a mess to validate. For that matter, we have formats like “English,” which lacks any spec and is a pile of kludges that have accumulated over centuries. Try finding support for supposed improvements such as Esperanto anywhere.

It’s a situation we just have to live with. The good enough hangs on, and the better has a hard time getting acceptance. Considering how unstable the world of data would be if this weren’t the case, it’s a good thing on the whole.

Don’t hide those file extensions!

Lately I’ve ghostwritten several pieces on Internet security and how to protect yourself against malicious files. One point comes up over and over: Don’t hide file extensions! If you get a file called Evilware.pdf.exe, then Microsoft thinks you should see it as Evilware.pdf. The default setting on Windows conceals file extensions from you; you have to change a setting to view files by their actual names.

What’s this supposed to accomplish, besides making you think executable files are just documents? I keep seeing vague statements that this somehow “simplifies” things for users. If they see a file called “Document.pdf,” Microsoft’s marketing department thinks people will say, “What’s that .pdf at the end of the name? This is too bewildering and technical for me! I give up on this computer!”

They also seem to think that when people run a .exe file, not knowing it is one because the extension is hidden, and it turns out to be ransomware that encrypts all the files on the computer, that’s a reasonable price to pay for making file names look simpler. It’s always marketing departments that are to blame for this kind of stupidity; I’m sure the engineers know better.
Continue reading

PDF/A and forms

The PDF Association reminds us that we can use PDF forms for electronic submissions. It’s a useful feature, and I’ve filled out PDF forms now and then. However, one point seems wrong to me:

PDF/A, the archival subset of PDF technology, provides a means of ensuring the quality and usability of conforming PDF pages (including PDF forms) without any external dependencies. PDF/A offers implementers the confidence of knowing that conforming documents and forms will be readable 10, 20 or 200 years from now.

The problem is that PDF/A doesn’t allow form actions. ISO 19005-1 says, “Interactive form fields shall not perform actions of any type.” You can have a form and you can print it, but without being able to perform the submit-form action, it isn’t useful for digital submissions.

You could have an archival version of the form and a way to convert it to an interactive version, but this seems clumsy. Please let me know if I’ve missed something.

Update: There’s some kind of irony in the fact that the same day that I posted this, I received a print-only PDF form which I’ll now have to take to Staples to fax to the originator.

XKCD on digital preservation

Today’s XKCD comic comments on digital preservation in Randall Munroe’s usual style.
XKCD cartoon on Digital Data
Continue reading

Are uncompressed files better for preservation?

How big a concern is physical degradation of files, aka “bit rot,” to digital preservation? Should archives eschew data compression in order to minimize the effect of lost bits? In most of my experience, no one’s raised that as a major concern, but some contributors to the TI/A initiative consider it important enough to affect their recommendations.
Continue reading

Tim Berners-Lee on “trackable” ebooks

Ebooks of the future, says Tim Berners-Lee, should be permanent, seamless, linked, and trackable. That’s three good ideas and one very bad one.

Speaking at BookExpo America, he offered these as the four attributes of the ebooks of the future. They’ll achieve permanence through encoding in HTML5, which is what EPUB basically is. Any ebook that’s available only in a proprietary format with DRM is doomed to extinction. Pinning hopes on Amazon’s eternal existence and support of its present formats is foolish. Seamlessness, the ability to transition through different platforms and content types, follows from using HTML5. This is reasonable and not very controversial.
Continue reading

Spintronics for data storage

DNA data storage sounds like the stuff of science fiction, yet other technologies look even farther out. Spintronics data storage offers greater storage density and stability than magnetic storage, if engineers can get it to work. It depends on the quantum property of an electron called “spin,” which is a measure of angular momentum but doesn’t exactly mean the electron is spinning like an orbiting planet. Analogies of quantum properties to the macroscopic world don’t work very well.

It turns out there are more kinds of magnetism than the ferromagnetism we’re familiar with. Spintronics uses antiferromagnetism. With ferromagnetic materials, ions all line up their individual magnetic fields in the same direction, so that the material overall has a noticeable magnetic field. In antiferromagnetic materials, they line up in “antiparallel” formation, head to head and tail to tail, so that the fields cancel out and there’s no magnetic field on a large scale. With materials of this kind, it’s feasible (for cutting-edge values of “feasible”) to manipulate the spin of the electrons of individual atoms (or perhaps pairs of atoms is more exact), flipping them magnetically.
Continue reading

Making indestructible archives with IPFS

Redundancy is central to digital preservation. When only one copy exists, it’s easy to destroy it. Backups and mirrors help, and the more copies there are, the safer the content is. The InterPlanetary File System (IPFS) is a recent technology that could be tremendously valuable in creating distributed archives. I haven’t seen much discussion of it in digital preservation circles; Roy Tennant has an article in Library Journal briefly discussing it.

IPFS logoIPFS is based on a radical vision. Its supporters say that HTTP is broken and needs a replacement. What they mean is that location-based addressing by URLs makes the Web fragile. If a server loses a file, you get a 404 error. If the serving domain goes away, you don’t get any HTTP response. IPFS ensures the persistent availability of files by allowing multiple copies on nodes of a peer network. The trick is that they’re addressed by content, not name. An IPFS identifier uses a hash of the content. This protects against file tampering and degradation at the same time; it also means that objects are immutable.

IPFS hashes are long strings that no one’s going to remember, so a naming convention called IPNS (InterPlanetary Naming Service) can sit on top of it. IPNS seems to be in a primitive state right now; in the future, it may support versioning and viewing the history of named objects. IPFS itself supports a tree of versioned objects.

The enthusiasts of IPFS don’t talk much about dynamic content; the whole concept is antithetical to dynamic, interactive delivery. I can’t imagine any way IPFS could support a login or an online purchase. This means that it can never completely replace HTTP, to say nothing of HTTPS.

What’s especially exciting about IPFS for readers of this blog is its potential for creating distributed archives. An IPFS network can be either public or private. A private net could consist of any number of geographically dispersed nodes. They aren’t mirror images of each other; each node can contain some or all of the archive’s content. Nodes publish objects by adding them to a Distributed Hash Table (DHT); if it isn’t there, no one knows how to request it. They can decide which nodes listed in the DHT they’re going to copy. I don’t know if there’s any way to tell how many nodes have a copy of a file or any balancing mechanism to guarantee that each node has a safe number of copies; a robust archive would need these features. If a server is going to drop out of the network in an orderly manner, it needs to make sure at least one other node has every file that it wants to persist. Short of having these features, a distributed archive could set up a policy of putting every file on every node, or it could create a partitioning scheme. For instance, it could compute a three-bit hash of all objects, and each node would be responsible for grabbing files with overlapping subsets of the eight possible hashes.

Some of you must already be thinking about LOCKSS and how that compares with IPFS. The comparison isn’t one-to-one; LOCKSS includes higher-level protocols, such as OAIS ingest and format migration. It isn’t about making distributed copies; participants independently subscribe to content and compare copies with one another, copying to fix damaged files. An IPFS network assumes that all participants have access to all shared content. For a large public domain archive, this could be ideal.

With a public IPFS network, removing material is virtually impossible. This is intentional; redaction and censorship are computationally indistinguishable. An IPFS under a single authority, however, can delete all copies of a file if it turns out to violate copyright or have private information. Or, unfortunately, if an authoritarian government orders removal of information that it doesn’t want known.

IPFS could develop into a hot technology for archives. Developers should look into it.

A closer look at DNA storage

A week ago, in my article “Data Storage Meets Biotech,” I wrote about work on DNA as a data storage medium. People on the Internet are getting wildly optimistic about it, talking about storing the entire Internet in a device that’s the size of a sugar cube and will last for centuries. Finding serious analysis is difficult.

DNA molecule representationFor most people, DNA is some kind of magic. The Fantastic Four gained their powers when space radiation altered their DNA. Barack Obama, in one of the most inappropriate metaphors in presidential history, said racism is “part of our DNA that’s passed on.” People want mandatory warning labels on food containing DNA. Finding knowledgeable discussion amid all the noise is difficult. I’m certainly no chemist; I started out majoring in chemistry, but fled soon after my first encounter with college-level lab work.
Continue reading

Data storage meets biotech

With Microsoft’s entry into the field, the use of DNA for data storage is an increasingly serious area of research. DNA is effectively a base-4 data medium, it’s extremely compact, and it contains its own copying mechanism.

DNA molecule representationDNA has actually been used to store data; in 2012 researchers at Harvard wrote a book into a DNA molecule and read it back. It’s still much more expensive than competing technologies, though; a recent estimate says it costs $12,000 to write a megabyte and $200 to read it back. The article didn’t specify the scale; surely the cost per megabyte would go down rapidly with the amount of data stored in one molecule.

Don’t expect a disk drive in a molecule. DNA isn’t a random-access medium. Rather, it would be used to archive a huge amount of information and later read it back in bulk. A wild idea would be to store information in a human ovum so it would be passed through generations, making it literal ancestral memory. Now there’s real Mad File Format Science for you!