Monthly Archives: May 2016

PDF/A and forms

The PDF Association reminds us that we can use PDF forms for electronic submissions. It’s a useful feature, and I’ve filled out PDF forms now and then. However, one point seems wrong to me:

PDF/A, the archival subset of PDF technology, provides a means of ensuring the quality and usability of conforming PDF pages (including PDF forms) without any external dependencies. PDF/A offers implementers the confidence of knowing that conforming documents and forms will be readable 10, 20 or 200 years from now.

The problem is that PDF/A doesn’t allow form actions. ISO 19005-1 says, “Interactive form fields shall not perform actions of any type.” You can have a form and you can print it, but without being able to perform the submit-form action, it isn’t useful for digital submissions.

You could have an archival version of the form and a way to convert it to an interactive version, but this seems clumsy. Please let me know if I’ve missed something.

Update: There’s some kind of irony in the fact that the same day that I posted this, I received a print-only PDF form which I’ll now have to take to Staples to fax to the originator.

Floppies aren’t dead

Today’s exciting news on Twitter is that one or more of the Department of Defense systems used to coordinate ICBMs and nuclear bombers still use 8-inch floppy disks. A spokesperson for the DoD explained, “It still works.” The computer is an IBM Series/1 that dates from the seventies.
Continue reading

XKCD on digital preservation

Today’s XKCD comic comments on digital preservation in Randall Munroe’s usual style.
XKCD cartoon on Digital Data
Continue reading

Are uncompressed files better for preservation?

How big a concern is physical degradation of files, aka “bit rot,” to digital preservation? Should archives eschew data compression in order to minimize the effect of lost bits? In most of my experience, no one’s raised that as a major concern, but some contributors to the TI/A initiative consider it important enough to affect their recommendations.
Continue reading

Tim Berners-Lee on “trackable” ebooks

Ebooks of the future, says Tim Berners-Lee, should be permanent, seamless, linked, and trackable. That’s three good ideas and one very bad one.

Speaking at BookExpo America, he offered these as the four attributes of the ebooks of the future. They’ll achieve permanence through encoding in HTML5, which is what EPUB basically is. Any ebook that’s available only in a proprietary format with DRM is doomed to extinction. Pinning hopes on Amazon’s eternal existence and support of its present formats is foolish. Seamlessness, the ability to transition through different platforms and content types, follows from using HTML5. This is reasonable and not very controversial.
Continue reading

Spintronics for data storage

DNA data storage sounds like the stuff of science fiction, yet other technologies look even farther out. Spintronics data storage offers greater storage density and stability than magnetic storage, if engineers can get it to work. It depends on the quantum property of an electron called “spin,” which is a measure of angular momentum but doesn’t exactly mean the electron is spinning like an orbiting planet. Analogies of quantum properties to the macroscopic world don’t work very well.

It turns out there are more kinds of magnetism than the ferromagnetism we’re familiar with. Spintronics uses antiferromagnetism. With ferromagnetic materials, ions all line up their individual magnetic fields in the same direction, so that the material overall has a noticeable magnetic field. In antiferromagnetic materials, they line up in “antiparallel” formation, head to head and tail to tail, so that the fields cancel out and there’s no magnetic field on a large scale. With materials of this kind, it’s feasible (for cutting-edge values of “feasible”) to manipulate the spin of the electrons of individual atoms (or perhaps pairs of atoms is more exact), flipping them magnetically.
Continue reading

JHOVE 1.14

The Open Preservation Foundation has just announced JHOVE 1.14. The numbering is a bit odd. Version 1.12 never made it to release, and they seem to have skipped 1.13 entirely.

This includes three new modules: the PNG module, which I wrote on a weekend whim, and GZIP and WARC modules adapted from JHOVE2. The UTF-8 module now supports Unicode 7.0.

The release isn’t showing up yet on the OPF website, but I expect that will happen momentarily.

It’s nice to see that the code which I started working on over a decade ago is still alive and useful. Congratulations and thanks to Carl Wilson, who’s now its principal maintainer!

Making indestructible archives with IPFS

Redundancy is central to digital preservation. When only one copy exists, it’s easy to destroy it. Backups and mirrors help, and the more copies there are, the safer the content is. The InterPlanetary File System (IPFS) is a recent technology that could be tremendously valuable in creating distributed archives. I haven’t seen much discussion of it in digital preservation circles; Roy Tennant has an article in Library Journal briefly discussing it.

IPFS logoIPFS is based on a radical vision. Its supporters say that HTTP is broken and needs a replacement. What they mean is that location-based addressing by URLs makes the Web fragile. If a server loses a file, you get a 404 error. If the serving domain goes away, you don’t get any HTTP response. IPFS ensures the persistent availability of files by allowing multiple copies on nodes of a peer network. The trick is that they’re addressed by content, not name. An IPFS identifier uses a hash of the content. This protects against file tampering and degradation at the same time; it also means that objects are immutable.

IPFS hashes are long strings that no one’s going to remember, so a naming convention called IPNS (InterPlanetary Naming Service) can sit on top of it. IPNS seems to be in a primitive state right now; in the future, it may support versioning and viewing the history of named objects. IPFS itself supports a tree of versioned objects.

The enthusiasts of IPFS don’t talk much about dynamic content; the whole concept is antithetical to dynamic, interactive delivery. I can’t imagine any way IPFS could support a login or an online purchase. This means that it can never completely replace HTTP, to say nothing of HTTPS.

What’s especially exciting about IPFS for readers of this blog is its potential for creating distributed archives. An IPFS network can be either public or private. A private net could consist of any number of geographically dispersed nodes. They aren’t mirror images of each other; each node can contain some or all of the archive’s content. Nodes publish objects by adding them to a Distributed Hash Table (DHT); if it isn’t there, no one knows how to request it. They can decide which nodes listed in the DHT they’re going to copy. I don’t know if there’s any way to tell how many nodes have a copy of a file or any balancing mechanism to guarantee that each node has a safe number of copies; a robust archive would need these features. If a server is going to drop out of the network in an orderly manner, it needs to make sure at least one other node has every file that it wants to persist. Short of having these features, a distributed archive could set up a policy of putting every file on every node, or it could create a partitioning scheme. For instance, it could compute a three-bit hash of all objects, and each node would be responsible for grabbing files with overlapping subsets of the eight possible hashes.

Some of you must already be thinking about LOCKSS and how that compares with IPFS. The comparison isn’t one-to-one; LOCKSS includes higher-level protocols, such as OAIS ingest and format migration. It isn’t about making distributed copies; participants independently subscribe to content and compare copies with one another, copying to fix damaged files. An IPFS network assumes that all participants have access to all shared content. For a large public domain archive, this could be ideal.

With a public IPFS network, removing material is virtually impossible. This is intentional; redaction and censorship are computationally indistinguishable. An IPFS under a single authority, however, can delete all copies of a file if it turns out to violate copyright or have private information. Or, unfortunately, if an authoritarian government orders removal of information that it doesn’t want known.

IPFS could develop into a hot technology for archives. Developers should look into it.

A closer look at DNA storage

A week ago, in my article “Data Storage Meets Biotech,” I wrote about work on DNA as a data storage medium. People on the Internet are getting wildly optimistic about it, talking about storing the entire Internet in a device that’s the size of a sugar cube and will last for centuries. Finding serious analysis is difficult.

DNA molecule representationFor most people, DNA is some kind of magic. The Fantastic Four gained their powers when space radiation altered their DNA. Barack Obama, in one of the most inappropriate metaphors in presidential history, said racism is “part of our DNA that’s passed on.” People want mandatory warning labels on food containing DNA. Finding knowledgeable discussion amid all the noise is difficult. I’m certainly no chemist; I started out majoring in chemistry, but fled soon after my first encounter with college-level lab work.
Continue reading

Aside

The principal URL for this blog is now https://madfileformatscience.garymcgath.com. The old URL still works and nothing else has changed; the change is just to tie this blog closer to my website. Also please note the new “Donate” button. It’s all … Continue reading