Redundancy is central to digital preservation. When only one copy exists, it’s easy to destroy it. Backups and mirrors help, and the more copies there are, the safer the content is. The InterPlanetary File System (IPFS) is a recent technology that could be tremendously valuable in creating distributed archives. I haven’t seen much discussion of it in digital preservation circles; Roy Tennant has an article in Library Journal briefly discussing it.
IPFS is based on a radical vision. Its supporters say that HTTP is broken and needs a replacement. What they mean is that location-based addressing by URLs makes the Web fragile. If a server loses a file, you get a 404 error. If the serving domain goes away, you don’t get any HTTP response. IPFS ensures the persistent availability of files by allowing multiple copies on nodes of a peer network. The trick is that they’re addressed by content, not name. An IPFS identifier uses a hash of the content. This protects against file tampering and degradation at the same time; it also means that objects are immutable.
IPFS hashes are long strings that no one’s going to remember, so a naming convention called IPNS (InterPlanetary Naming Service) can sit on top of it. IPNS seems to be in a primitive state right now; in the future, it may support versioning and viewing the history of named objects. IPFS itself supports a tree of versioned objects.
The enthusiasts of IPFS don’t talk much about dynamic content; the whole concept is antithetical to dynamic, interactive delivery. I can’t imagine any way IPFS could support a login or an online purchase. This means that it can never completely replace HTTP, to say nothing of HTTPS.
What’s especially exciting about IPFS for readers of this blog is its potential for creating distributed archives. An IPFS network can be either public or private. A private net could consist of any number of geographically dispersed nodes. They aren’t mirror images of each other; each node can contain some or all of the archive’s content. Nodes publish objects by adding them to a Distributed Hash Table (DHT); if it isn’t there, no one knows how to request it. They can decide which nodes listed in the DHT they’re going to copy. I don’t know if there’s any way to tell how many nodes have a copy of a file or any balancing mechanism to guarantee that each node has a safe number of copies; a robust archive would need these features. If a server is going to drop out of the network in an orderly manner, it needs to make sure at least one other node has every file that it wants to persist. Short of having these features, a distributed archive could set up a policy of putting every file on every node, or it could create a partitioning scheme. For instance, it could compute a three-bit hash of all objects, and each node would be responsible for grabbing files with overlapping subsets of the eight possible hashes.
Some of you must already be thinking about LOCKSS and how that compares with IPFS. The comparison isn’t one-to-one; LOCKSS includes higher-level protocols, such as OAIS ingest and format migration. It isn’t about making distributed copies; participants independently subscribe to content and compare copies with one another, copying to fix damaged files. An IPFS network assumes that all participants have access to all shared content. For a large public domain archive, this could be ideal.
With a public IPFS network, removing material is virtually impossible. This is intentional; redaction and censorship are computationally indistinguishable. An IPFS under a single authority, however, can delete all copies of a file if it turns out to violate copyright or have private information. Or, unfortunately, if an authoritarian government orders removal of information that it doesn’t want known.
IPFS could develop into a hot technology for archives. Developers should look into it.
Identifying files by programming language
Most of today’s programming languages look vaguely similar. They’re derived from the C syntax, with similar ways of expressing assignments, arithmetic, conditionals, nested expressions, and groups of statements. If the files have their original extension and it’s accurate, format identification software should be able to classify them correctly.
The software should do some basic checks to make sure it wasn’t handed a binary file with a false extension, which could be dangerous. A code file should be a text file. regardless of the language. (This isn’t strictly true, but non-text languages like Piet and Velato are just obscure for the sake of obscurity.) The UK National Archive recognizes XML and JSON (which is a subset of JavaScript) but doesn’t talk about programming languages as file formats. Exiftool identifies lots of formats but makes no attempt to discern programming languages.
Continue reading →
1 Comment
Posted in commentary
Tagged archiving, file identification, software