Category Archives: commentary

Spintronics for data storage

DNA data storage sounds like the stuff of science fiction, yet other technologies look even farther out. Spintronics data storage offers greater storage density and stability than magnetic storage, if engineers can get it to work. It depends on the quantum property of an electron called “spin,” which is a measure of angular momentum but doesn’t exactly mean the electron is spinning like an orbiting planet. Analogies of quantum properties to the macroscopic world don’t work very well.

It turns out there are more kinds of magnetism than the ferromagnetism we’re familiar with. Spintronics uses antiferromagnetism. With ferromagnetic materials, ions all line up their individual magnetic fields in the same direction, so that the material overall has a noticeable magnetic field. In antiferromagnetic materials, they line up in “antiparallel” formation, head to head and tail to tail, so that the fields cancel out and there’s no magnetic field on a large scale. With materials of this kind, it’s feasible (for cutting-edge values of “feasible”) to manipulate the spin of the electrons of individual atoms (or perhaps pairs of atoms is more exact), flipping them magnetically.
Continue reading

Making indestructible archives with IPFS

Redundancy is central to digital preservation. When only one copy exists, it’s easy to destroy it. Backups and mirrors help, and the more copies there are, the safer the content is. The InterPlanetary File System (IPFS) is a recent technology that could be tremendously valuable in creating distributed archives. I haven’t seen much discussion of it in digital preservation circles; Roy Tennant has an article in Library Journal briefly discussing it.

IPFS logoIPFS is based on a radical vision. Its supporters say that HTTP is broken and needs a replacement. What they mean is that location-based addressing by URLs makes the Web fragile. If a server loses a file, you get a 404 error. If the serving domain goes away, you don’t get any HTTP response. IPFS ensures the persistent availability of files by allowing multiple copies on nodes of a peer network. The trick is that they’re addressed by content, not name. An IPFS identifier uses a hash of the content. This protects against file tampering and degradation at the same time; it also means that objects are immutable.

IPFS hashes are long strings that no one’s going to remember, so a naming convention called IPNS (InterPlanetary Naming Service) can sit on top of it. IPNS seems to be in a primitive state right now; in the future, it may support versioning and viewing the history of named objects. IPFS itself supports a tree of versioned objects.

The enthusiasts of IPFS don’t talk much about dynamic content; the whole concept is antithetical to dynamic, interactive delivery. I can’t imagine any way IPFS could support a login or an online purchase. This means that it can never completely replace HTTP, to say nothing of HTTPS.

What’s especially exciting about IPFS for readers of this blog is its potential for creating distributed archives. An IPFS network can be either public or private. A private net could consist of any number of geographically dispersed nodes. They aren’t mirror images of each other; each node can contain some or all of the archive’s content. Nodes publish objects by adding them to a Distributed Hash Table (DHT); if it isn’t there, no one knows how to request it. They can decide which nodes listed in the DHT they’re going to copy. I don’t know if there’s any way to tell how many nodes have a copy of a file or any balancing mechanism to guarantee that each node has a safe number of copies; a robust archive would need these features. If a server is going to drop out of the network in an orderly manner, it needs to make sure at least one other node has every file that it wants to persist. Short of having these features, a distributed archive could set up a policy of putting every file on every node, or it could create a partitioning scheme. For instance, it could compute a three-bit hash of all objects, and each node would be responsible for grabbing files with overlapping subsets of the eight possible hashes.

Some of you must already be thinking about LOCKSS and how that compares with IPFS. The comparison isn’t one-to-one; LOCKSS includes higher-level protocols, such as OAIS ingest and format migration. It isn’t about making distributed copies; participants independently subscribe to content and compare copies with one another, copying to fix damaged files. An IPFS network assumes that all participants have access to all shared content. For a large public domain archive, this could be ideal.

With a public IPFS network, removing material is virtually impossible. This is intentional; redaction and censorship are computationally indistinguishable. An IPFS under a single authority, however, can delete all copies of a file if it turns out to violate copyright or have private information. Or, unfortunately, if an authoritarian government orders removal of information that it doesn’t want known.

IPFS could develop into a hot technology for archives. Developers should look into it.

A closer look at DNA storage

A week ago, in my article “Data Storage Meets Biotech,” I wrote about work on DNA as a data storage medium. People on the Internet are getting wildly optimistic about it, talking about storing the entire Internet in a device that’s the size of a sugar cube and will last for centuries. Finding serious analysis is difficult.

DNA molecule representationFor most people, DNA is some kind of magic. The Fantastic Four gained their powers when space radiation altered their DNA. Barack Obama, in one of the most inappropriate metaphors in presidential history, said racism is “part of our DNA that’s passed on.” People want mandatory warning labels on food containing DNA. Finding knowledgeable discussion amid all the noise is difficult. I’m certainly no chemist; I started out majoring in chemistry, but fled soon after my first encounter with college-level lab work.
Continue reading

Data storage meets biotech

With Microsoft’s entry into the field, the use of DNA for data storage is an increasingly serious area of research. DNA is effectively a base-4 data medium, it’s extremely compact, and it contains its own copying mechanism.

DNA molecule representationDNA has actually been used to store data; in 2012 researchers at Harvard wrote a book into a DNA molecule and read it back. It’s still much more expensive than competing technologies, though; a recent estimate says it costs $12,000 to write a megabyte and $200 to read it back. The article didn’t specify the scale; surely the cost per megabyte would go down rapidly with the amount of data stored in one molecule.

Don’t expect a disk drive in a molecule. DNA isn’t a random-access medium. Rather, it would be used to archive a huge amount of information and later read it back in bulk. A wild idea would be to store information in a human ovum so it would be passed through generations, making it literal ancestral memory. Now there’s real Mad File Format Science for you!

Designing to the demo is a mistake

A lot of software design clearly aims not at providing the best experience to the user, but at providing the most impressive demo. Apple does this all the time, or at least that’s the only explanation I can think of for their design decisions. Getting people to applaud in amazement doesn’t get loyal customers if the product is terrible in everyday use, though.

My current Garmin car GPS device is a good example of this. To enter an address, you enter first the street number, then the street, and finally the locality and state together. This sounds very natural, much better than my old device where you started with the state and worked down to the street number. The trouble is that when you use the new device, you find that auto-completion is useless.
Continue reading

Want FLAC on your Mac? Try Vox

Vox application windowiTunes is horrible and keeps getting worse. The current version has come down with dyslexia; it can’t even play my files in order. On top of that, it supports a poor range of file formats, knowing nothing about popular open formats like FLAC and Ogg Vorbis. QuickTime Player has a saner user interface but the same format limitations. If you want to play music in those formats, you need to look for other software. I’ve just grabbed Vox for OS X, and it handles those files nicely.

It’s not an iTunes replacement, even if all you want to do is play music that’s stored on your computer. You can import your iTunes library, but you can’t view the contents of your playlists (which it calls “collections”) or select items from them. What it does let you do, though, is play FLAC, AAC, ALAC (Apple Lossless), Ogg, MP3, and APE files.
Continue reading

Is JHOVE dead in the water again?

See this post for important updates.

JHOVE logoIn December, JHOVE 12.0 was very close to a release. Since then, next to nothing has happened. The installer for the beta version expired, and there’s been an update for that. A couple of pull requests have been merged. Otherwise — nothing.

I think what’s happened is that the Open Preservation Foundation’s very limited resources were pulled onto VeraPDF. That’s certainly a worthwhile endeavor, but it irks me that I handed support of JHOVE over to OPF only to see the ball dropped. I did some work on a PNG module a month ago and submitted a pull request; nothing’s happened since then.

I wouldn’t mind picking JHOVE up agin, but I’m going to be blunt about this: I’m done with working on it for free. If institutions that want JHOVE to be maintained really care about it, they should put up some money, whether it’s to OPF, to me, or to someone else. Open source software isn’t something that magically happens because people love to work without pay.

When do the MP3 patents expire?

MP3 logoWhy exactly is MP3 still popular? It’s not as efficient as more recent compression methods, and it’s encumbered by patents. People keep using what’s familiar. In a few years, it may become patent-free.

A Tunequest piece from 2007 lists several expiration dates that are still in the future:
Continue reading

The Java file format API graveyard

If you look for Java libraries to support specific file formats, you’ll soon come upon the gloomy graveyard of Java APIs. Sun and Oracle have a history of devising nice packages for reading and writing different kinds of files, only to abandon their maintenance. You can still find pages for them, and it takes a close look to figure out that they aren’t supported any more.

Java Advanced Imaging (JAI) was nice in its time. It still has a page on Oracle’s website, but the latest “what’s new” item is dated 2007. The page brags about customer success stories as if it were still usable code. I’ve tried working with it. It’s out of sync with the current com.sun classes, and I got only limited use out of it. In its time it was a good way to read and write image files.

Java Media Framework (JMF) runs on a 166 MHz Pentium or 160 MHz PowerPC. The downloaded jars are dated May 1, 2003. It had a nice list of supported formats.

If you’re working with audio files, javax.sound looks more encouraging. Its API is listed with Java 8. The class java.sound.sampled.AudioSystem supports reading and writing of audio files. I can’t find a list of the supported formats.

Java does reliably support some formats. Its handling of text encodings is versatile, and java.util.zip handles ZIP and GZIP.

Third-party code can come to the rescue. For reading and writing PDF, Apache PDFBox looks like the best bet. You can use Apache Tika with lots of formats, if you just need to extract metadata. Another alternative is to use ImageMagick, but it runs natively rather than under the JVM, so you have to invoke it with exec calls. im4java and JMagick can save some of the tedium. There are open source Java libraries for reading and writing specific file formats. Some may be good, some not.

If you need to deal with the guts of file formats in Java, you’ll usually have to find some good third-party code or start writing your own.

Closed captioning formats

CC logoAn online discussion led to my learning about Udemy’s support for closed captioning and to the formats available for it. Since I hadn’t heard about these formats before, I’m guessing a lot of other people haven’t. They can be useful not only for accessibility but for preservation, since they provide a textual version of spoken words in a video. These are just some notes on what I’ve found in a cursory investigation. In general, sites that support closed captioning expect a text file in one of several formats, which has to have at least the text of the caption, its starting time, and its duration or ending time.
Continue reading