A closer look at DNA storage

A week ago, in my article “Data Storage Meets Biotech,” I wrote about work on DNA as a data storage medium. People on the Internet are getting wildly optimistic about it, talking about storing the entire Internet in a device that’s the size of a sugar cube and will last for centuries. Finding serious analysis is difficult.

DNA molecule representationFor most people, DNA is some kind of magic. The Fantastic Four gained their powers when space radiation altered their DNA. Barack Obama, in one of the most inappropriate metaphors in presidential history, said racism is “part of our DNA that’s passed on.” People want mandatory warning labels on food containing DNA. Finding knowledgeable discussion amid all the noise is difficult. I’m certainly no chemist; I started out majoring in chemistry, but fled soon after my first encounter with college-level lab work.

Fortunately, the paper “A DNA-Based Archival Storage System”, by researchers from the University of Washington and Microsoft Research, is available as an unrestricted download. This is the one key article to read. It’s tough going in places for me, but let’s look at some of its points.

It mentions that DNA has a half-life of over 500 years. Half-life is a number expressing the rate of exponential decay. An article from The Scientist explains that after that period, half of its nucleotide bonds are broken. This seems to mean that if a DNA molecule holds data without redundancy, after 500 years it’s lost about half of its bits. Data storage is useless when even 1% of its bits have gone bad. Using a convenient online half-life calculator, I find that losing 1% of its bonds takes about seven years. Put this way it doesn’t sound nearly as impressive, but highly redundant storage is practical, allowing error correction. Current sequencing techniques, according to the paper, introduce a 1% error to start with, so redundancy would be necessary in any case. The 500-year figure appears to assume a natural environment, and the half-life would be better in climate-controlled storage. The article discusses possible redundant encodings, drawing explicit comparisons to RAID encoding.

According to the article, “We envision DNA storage as the very last level of a deep storage hierarchy, providing very dense and durable archival storage with access times of many hours to days. DNA synthesis and sequencing can be made arbitrarily parallel, making the necessary read and write bandwidths attainable.” This confirms my initial reading that it’s not a replacement for a disk drive.

Each DNA strand in the proposed scheme would hold only about a hundred bits. I’d been thinking in terms of human DNA, which contains millions of nucleotides per chromosome. A storage device would have to contain billions of molecules, so parallelism and redundancy should be easy once you have an effective reading mechanism. Storage would use fluorescent nucleotides to allow optical reading of the DNA.

The details of the proposal are fascinating. It recommends using nucleotides as a base-3 encoding, rather than the natural base 4, using a scheme that avoids repeating the same nucleotide twice, a situation which is particularly error-prone. The scheme includes parity nucleotides for error detection.

There’s something wonderful about the phrase “parity nucleotide.” It sounds like high-quality science fiction. Another wonderful detail is this: “Primers allow random access via a polymerase chain reaction (PCR), which produces many copies of a piece of DNA in a solution.” That means, if I’m reading it correctly, that DNA’s ability to self-replicate is an essential part of the reading mechanism!

The article suggests that redundancy is more important to some data than others; for instance, errors in a file header will often make it unreadable, while a pixel error isn’t critical. With the combination of less reliable bits with extremely large storage capacities, DNA storage could create a significant change in file format technology. You don’t need compression when your device holds zettabytes, and compressed files are vastly more vulnerable to single-bit errors than uncompressed ones. New formats specifically for ultra-high capacity archives may spring up, with built-in redundancy at critical points and high error tolerance elsewhere.

When I started working on this piece, I expected to be very skeptical. After reading the article, I actually have more confidence that DNA storage is a real possibility than I did from reading the accounts in popular media. It’ll be an interesting future.

Comments are closed.