My brief post yesterday on the TI/A initiative provoked a lively discussion on Twitter, mostly on whether archival formats should allow compression. The argument against compression rests on the argument that archives should be able to deal with files that have a few bit errors in them. This is a badly mistaken idea.
Files are subject to “bit rot.” When they’re stored long enough, bits in them may become unreliably readable or change value. Depending on which ones change, this may ruin the file, degrade it just slightly, or have no effect at all.
Let’s consider a TIFF file. If a bit changes in the file header or a tag header, it’s likely to make the file useless, perhaps not even recognizable as TIFF. If it changes in a stream of compressed data, it will probably turn a significant chunk of the file into visual noise. A bit error in a text description or uncompressed image data will introduce just a little noise, and a change in an unused byte will have no effect.
The argument is that a bit error in uncompressed image data will do far less damage than one in compressed image data, so banning compressed images from an archive will improve file viability.
The premise behind this idea is that archives should be bit-rot tolerant. That is, archives should consider files acceptable even they don’t match their digest, and try to salvage what they can. This is just a bad idea. Here are a few reasons:
- You’re playing Russian roulette. If a file doesn’t pass its fixity check, anything could be damaged. You don’t even know how many bits are damaged. The file could be useless or seriously degraded. An archive should work with files that pass fixity checks and use a backup to recover failed ones.
- Using only uncompressed files increases the probability of bit rot. I’m assuming bit rot is a Poisson process, with each bit on a drive equally likely to fail. If an uncompressed file is three times as long as a compressed file, then it’s roughly three times as likely to have an error, assuming a very low per-bit error rate. Insisting on uncompressed files but accepting damaged files accepts an increased probability of file damage in exchange for a hope that the damage isn’t serious.
- Once a file starts failing, it’s likely to develop more errors over time. The storage medium is going bad.
- It’s better to know that a file is damaged than to hope it isn’t too badly damaged. The sooner a defect is caught, the better the chances are of finding a good copy to recover from.
Archives should take the best possible measures to keep their files intact, and make provision for recovering them when any damage is detected. Keeping damaged files in the hope that they’re still useful should be only a last resort. Banning compression from archives in the hope of minimizing the damage from bit rot is a foolish preservation strategy.
I think it varies from format to format. We have video files in transport-stream format, for example. Transport streams, which in our case contain compressed video, are designed to be tolerant to bit-errors – you don’t want your TV decoder to give up just because of some momentary signal interference.
Warc also has some in-built bit-tolerance because each record includes a checksum of its payload. So as long as you can still split it into individual records, you can identify which records are damaged.
That said, we are currently in the process of compressing our entire arc/warc archive precisely because we’ve come to the conclusion that our “compression is always bad” rationale doesn’t really hold up under scrutiny. Apart from anything else, the hardest data to preserve is always the data you didn’t gather in the first place. So when storage costs start to become a bottleneck – compress, compress, compress!
Streaming and file storage are very different things. Streaming needs to be tolerant not only of bit errors but of dropped packets.