Tag Archives: archiving

Identifying files by programming language

Most of today’s programming languages look vaguely similar. They’re derived from the C syntax, with similar ways of expressing assignments, arithmetic, conditionals, nested expressions, and groups of statements. If the files have their original extension and it’s accurate, format identification software should be able to classify them correctly.

The software should do some basic checks to make sure it wasn’t handed a binary file with a false extension, which could be dangerous. A code file should be a text file. regardless of the language. (This isn’t strictly true, but non-text languages like Piet and Velato are just obscure for the sake of obscurity.) The UK National Archive recognizes XML and JSON (which is a subset of JavaScript) but doesn’t talk about programming languages as file formats. Exiftool identifies lots of formats but makes no attempt to discern programming languages.
Continue reading

Web archiving and languages

Web archiving is difficult. Few sites consist entirely of static, self-contained content. Most use JavaScript, often from external sites. Responsive pages are designed to look different in different environments. An archive needs to make a snapshot that reflects its appearance at a given point in time, but what exactly does that mean? Should an archive pick an appearance for one reasonable set of parameters, or should it try to keep the page’s dynamic nature? Will the fact that it’s an archive rather than an interactive browser affect what the server gives it?
Continue reading

How to approach the file format validation problem

For years I wrote most of the code for JHOVE. With each format, I wrote tests for whether a file is “well-formed” and “valid.” With most formats, I never knew exactly what these terms meant. They come from XML, where they have clear meanings. A well-formed XML file has correct syntax. Angle brackets and quote marks match. Closing tags match opening tags. A valid file is well-formed and follows its schema. A file can be well-formed but not valid, but it can’t be valid without being well-formed.

With most other formats, there’s no definition of these terms. JHOVE applies them anyway. (I wrote the code, but I didn’t design JHOVE’s architecture. Not my fault.) I approached them by treating “well-formed” as meaning syntactically correct, and “valid” as meaning semantically correct. Drawing the line wasn’t always easy. If a required date field is missing, is the file not well-formed or just not valid? What if the date is supposed to be in ISO 8601 format but isn’t? How much does it matter?
Continue reading

DNA as data storage

What’s the oldest data format in the world? It’s not any of the ones that computer engineers developed in the 20th century, or even ones that telegraph engineers created in the 19th. Far older than those — by billions of years — is the DNA nucleotide sequence. We can think of it as a simple base-4 encoding of arbitrary length.

DNA double helix According to the usual, somewhat simplified, description, a DNA molecule is a double helix, with its backbone made of phosphates and sugars, and four types of nucleotides forming the sequence. They are adenine, guanine, thymine, and cytosine, or A, G, T, and C for short. They’re always found in pairs connecting the two strands of the helix. Adenine and thymine connect together, as do guanine and cytosine.

DNA for data encoding

That’s as deep as I care to go, since biochemistry is far away from my areas of expertise. What DNA does is fantastically complicated; change of few bits and you can get a human or a chimpanzee. But as a data model, it’s fantastically simple.
Continue reading

Flash in the Library of Congress’s online archives

Everybody recognizes that Adobe Flash is on the way out. It takes effort to convert existing websites, though, and some sites aren’t maintained, so it won’t disappear from the Web in the next few decades.

When it’s minor or abandoned sites, it doesn’t matter so much, but even the Library of Congress has the issue. Its National Jukebox currently requires a browser with Flash enabled to be useful. Turning on Flash for reliable sites such as the Library of Congress should be safe, at least as long as those sites don’t include third-party ads from dubious sources. Not everyone has that option, though. If you’re using iOS, you’re stuck.

I came across the National Jukebox while doing research for my book project Yesterday’s Songs Transformed, and it’s frustrating that I can’t currently use it without taking steps which I’d rather avoid. The good news is that this is a temporary situation and work is already underway to eliminate the Flash dependency. David Sager of the National Jukebox Team replied to my email inquiry:
Continue reading

The Joy Reid case and the fragility of archives

The exposure of old, embarrassing posts by MSNBC columnist Joy Ann Reid has provoked a lot of heated discussion. It’s also revealed the difficulty of retaining reliable information about old material on the Web.

When these old posts came to public attention through Twitter, she asserted that there had been one or more unauthorized break-ins altering her articles to add offensive content.

In December I learned that an unknown, external party accessed and manipulated material from my now-defunct blog, The Reid Report, to include offensive and hateful references that are fabricated and run counter to my personal beliefs and ideology.

I began working with a cyber-security expert who first identified the unauthorized activity, and we notified federal law enforcement officials of the breach. The manipulated material seems to be part of an effort to taint my character with false information by distorting a blog that ended a decade ago.

Now that the site has been compromised I can state unequivocally that it does not represent the original entries.

The “altered” material, however, also was found on the Internet Archive’s Wayback Machine with the same content. If Reid’s statement is true, the alterations must have taken place shortly after their publication and yet not been noticed, or else the Internet Archive must also have been compromised.

Continue reading

Bit-rot tolerance doesn’t work

My brief post yesterday on the TI/A initiative provoked a lively discussion on Twitter, mostly on whether archival formats should allow compression. The argument against compression rests on the argument that archives should be able to deal with files that have a few bit errors in them. This is a badly mistaken idea.
Continue reading

Making indestructible archives with IPFS

Redundancy is central to digital preservation. When only one copy exists, it’s easy to destroy it. Backups and mirrors help, and the more copies there are, the safer the content is. The InterPlanetary File System (IPFS) is a recent technology that could be tremendously valuable in creating distributed archives. I haven’t seen much discussion of it in digital preservation circles; Roy Tennant has an article in Library Journal briefly discussing it.

IPFS logoIPFS is based on a radical vision. Its supporters say that HTTP is broken and needs a replacement. What they mean is that location-based addressing by URLs makes the Web fragile. If a server loses a file, you get a 404 error. If the serving domain goes away, you don’t get any HTTP response. IPFS ensures the persistent availability of files by allowing multiple copies on nodes of a peer network. The trick is that they’re addressed by content, not name. An IPFS identifier uses a hash of the content. This protects against file tampering and degradation at the same time; it also means that objects are immutable.

IPFS hashes are long strings that no one’s going to remember, so a naming convention called IPNS (InterPlanetary Naming Service) can sit on top of it. IPNS seems to be in a primitive state right now; in the future, it may support versioning and viewing the history of named objects. IPFS itself supports a tree of versioned objects.

The enthusiasts of IPFS don’t talk much about dynamic content; the whole concept is antithetical to dynamic, interactive delivery. I can’t imagine any way IPFS could support a login or an online purchase. This means that it can never completely replace HTTP, to say nothing of HTTPS.

What’s especially exciting about IPFS for readers of this blog is its potential for creating distributed archives. An IPFS network can be either public or private. A private net could consist of any number of geographically dispersed nodes. They aren’t mirror images of each other; each node can contain some or all of the archive’s content. Nodes publish objects by adding them to a Distributed Hash Table (DHT); if it isn’t there, no one knows how to request it. They can decide which nodes listed in the DHT they’re going to copy. I don’t know if there’s any way to tell how many nodes have a copy of a file or any balancing mechanism to guarantee that each node has a safe number of copies; a robust archive would need these features. If a server is going to drop out of the network in an orderly manner, it needs to make sure at least one other node has every file that it wants to persist. Short of having these features, a distributed archive could set up a policy of putting every file on every node, or it could create a partitioning scheme. For instance, it could compute a three-bit hash of all objects, and each node would be responsible for grabbing files with overlapping subsets of the eight possible hashes.

Some of you must already be thinking about LOCKSS and how that compares with IPFS. The comparison isn’t one-to-one; LOCKSS includes higher-level protocols, such as OAIS ingest and format migration. It isn’t about making distributed copies; participants independently subscribe to content and compare copies with one another, copying to fix damaged files. An IPFS network assumes that all participants have access to all shared content. For a large public domain archive, this could be ideal.

With a public IPFS network, removing material is virtually impossible. This is intentional; redaction and censorship are computationally indistinguishable. An IPFS under a single authority, however, can delete all copies of a file if it turns out to violate copyright or have private information. Or, unfortunately, if an authoritarian government orders removal of information that it doesn’t want known.

IPFS could develop into a hot technology for archives. Developers should look into it.

Crowdsourcing song identification

Some friends of mine are pulling together a project for crowdsourcing identification of a large collection of music clips. At least a couple of us are professional software developers, but I’m the one with the most free time right now, and it fits with my library background, so I’ve become lead developer. In talking about it, we’ve realized it can be useful to librarians, archivists, and researchers, so we’re looking into making it a crowdfunded open source project.

A little background: “Filk music” is songs created and sung by science fiction and fantasy fans, mostly at conventions and in homes. I’ve offered a definition of filk on my website. There are some shoestring filk publishers; technically they’re in business, but it’s a labor of love rather than a source of income. Some of them have a large backlog of recordings from past conventions. Just identifying the songs and who’s singing them is a big task.

This project is, initially, for one of these filk publishers, who has the biggest backlog of anyone. The approach we’re looking at is making short clips available to registered crowdsource contributors, and letting them identify as much as they can of the song, the author, the performer(s), the original tune (many of these songs are parodies), etc. Reports would be delivered to editors for evaluation. There could be multiple reports on the same clip; editors would use their judgment on how to combine them. I’ve started on a prototype, using PHP and MySQL.

There’s a huge amount of enthusiasm among the people already involved, which makes me confident that at least the niche project will happen. The question is whether there may be broader interest. I can see this as a very useful tool for professionals dealing with archives of unidentified recordings: folk music, old jazz, transcribed wax cylinder collections, whatever. There’s very little in the current design that’s specific to one corner of the musical world.

The first question: Has anyone already done it? Please let me know if something like this already exists.

If not, how interesting does it sound? Would you like it to happen? What features would you like to see in it?

Update: On the Code4lib mailing list, Jodi Schneider pointed out that nichesourcing is a more precise word for what this project is about.

Patent application strikes at digital archiving

Someone called Henry Gladney has filed a US patent application which could be used to troll digital archiving operations in an attempt to force them to pay money for what they’ve been doing all along. The patent is more readable than many I’ve seen, and it’s simply a composite of existing standard practices such as schema-based XML, digital authentication, public key authentication, and globally unique identifiers. The application openly states that its PIP (Preservation Information Package) “is also an Archival Information Package as described within the forthcoming ISO OAIS standard.”

I won’t say this is unpatentable; all kinds of absurd software patents have been granted. As far as I’m concerned, software patents are inherently absurd; every piece of software is a new invention, each one builds on techniques used in previously written software, and the pace at which this happens makes a patent’s lifetime of fourteen to twenty years an eternity. If the first person to use any software technique were consistently deemed to own it and others were required to get permission to reuse it, we’d never have ventured outside the caves of assembly language. That’s not the view Congress takes, though.

Patent law does say, though, that you can’t patent something that’s already been done; the term is “prior art.” I can’t see anything in the application that’s new beyond the specific implementation. If it’s only that implementation which is patented, then archivists can and will simply use a different structure and not have to pay patent fees. If the application is granted and is used to get money out of anyone who creates archiving packages, there will be some nasty legal battles ahead, further demonstrating how counterproductive the software patent system is.

Update: There’s discussion on LinkedIn. Registration is required to comment, but not to just read.