What’s the oldest data format in the world? It’s not any of the ones that computer engineers developed in the 20th century, or even ones that telegraph engineers created in the 19th. Far older than those — by billions of years — is the DNA nucleotide sequence. We can think of it as a simple base-4 encoding of arbitrary length.
According to the usual, somewhat simplified, description, a DNA molecule is a double helix, with its backbone made of phosphates and sugars, and four types of nucleotides forming the sequence. They are adenine, guanine, thymine, and cytosine, or A, G, T, and C for short. They’re always found in pairs connecting the two strands of the helix. Adenine and thymine connect together, as do guanine and cytosine.
DNA for data encoding
That’s as deep as I care to go, since biochemistry is far away from my areas of expertise. What DNA does is fantastically complicated; change of few bits and you can get a human or a chimpanzee. But as a data model, it’s fantastically simple.
If you can create a DNA molecule with an arbitrary sequence, you can encode anything. It almost certainly wouldn’t produce any kind of living organism, which is probably a good thing. Imagine that any DNA encoding generated some kind of monster. You’d have the stuff of a Charles Stross horror novel, as GIF files went rampaging.
Data DNA is just data storage, not really biology, but it’s an extremely dense way of storing data. It’s been used to encode a video. According to a Science article, one gram of DNA can hold 215 petabytes of data.
Microsoft says it has a half-life of over 500 years. This is less impressive than it sounds, though. That doesn’t mean a molecule has a 50% chance of being intact after 500 years. It means that after 500 years, a molecule will have lost half its data.
How practical is DNA storage?
It has some advantages over other media. As a natural storage medium, DNA will exist as long as we do. This doesn’t mean, though, that people in the far future will understand the encoding methods used today.
The main barrier at present is the cost of writing and reading. The experiment with video storage cost $7000 to encode 2 megabytes of data. But treating a molecule as a long string of bits may not be the most economical way to use it. One company is looking into having a “library” of DNA molecules which can be combined to store data. This way, it isn’t necessary to create new types of molecules, just to duplicate existing ones.
I don’t know how DNA readers work, but there’s a compact USB reader. The trouble is that it’s less than 80% accurate. That doesn’t make it entirely useless, though; add enough error correction to an encoding, and it can produce accurate results by giving up some data density.
Sequencers have a significant error rate, so redundancy is necessary even if the readers are perfect. Claims of accuracy with the newest technology go as high as 99.8%, with somewhat lower accuracy being more common.
Malware in DNA?
Any form of data storage can be abused, and we’re seeing claims of malware stored in DNA molecules. What’s really going on is that some researchers created an intentionally buggy version of a compression algorithm for the FASTQ data format, which is widely used for encoding sequences.
This isn’t exactly exciting news. You can make a file format vulnerable by putting vulnerabilities into the code that implements it. Not the stuff of a red alert. The only thing that makes it interesting news is that, in principle, you could engineer a DNA molecule that took advantage of the bugs you introduced into the software. If you’ve got that much control over the system, why not just walk up to the sysadmin with a USB stick and say, “This has ransomware on it. Could you please install it on your servers?” It would be less work and have about as much of a chance of succeeding.
But news outlets have to have their sensational articles. The real future with DNA as a storage medium, even after discounting the hype, is exciting enough.
Take a look at my 2016 article on DNA data storage if you’re interested in my thoughts from back then on the subject.