Are uncompressed files better for preservation?

How big a concern is physical degradation of files, aka “bit rot,” to digital preservation? Should archives eschew data compression in order to minimize the effect of lost bits? In most of my experience, no one’s raised that as a major concern, but some contributors to the TI/A initiative consider it important enough to affect their recommendations.

Damaged Image File

Damaged Image file, Atlas of Digital Damages, placed on Flickr by Paul Wheatley. (CC BY-NC-SA 2.0)

Files can go bad, sometimes just by flipping a few bits. This can happen in the file system, the file header, the metadata, the structural elements, or the content data. Depending on where it happens, changing one bit can make the file unrenderable, degrade the image, or have no effect at all. The usual solution to this risk is digests and backups. The archive computes a digest, such as MD5 or SHA-1, of the file and stores it. When someone retrieves the file, the software recomputes its digest. If it doesn’t match, it warns the user that the file is damaged, and then it’s necessary to recover the backup copy. Not counting catastrophes that ruin whole files, the odds of file damage are low in a decent storage system, and the odds of the original and backup both being damaged are much lower.

Some people in the TI/A discussion argue against accepting compressed files as archival quality TIFF, because of their greater susceptibility to bit rot. In an uncompressed file that isn’t tiny, most of the data will be pixels, and flipping a bit will most likely just change a single pixel. Flipping a bit in a compressed data stream can mess up the decompression algorithm so that a large part of the image is damaged, or the application may crash. The argument is that a slightly damaged file is better than a seriously damaged one.

This theory looks like a bad one to me. First, it implies that the archive will trust damaged files to some extent. An uncompressed file with bit damage may just have a bad pixel, but the damage could be in the file header, the tags, or the ICC profile, seriously damaging the file or making it unusable. Second, the risk of bit damage to an uncompressed file is greater, simply because it’s bigger. At the same time, it takes up more storage space, so the archive can’t do as much backing up on a given budget. Lossless compression (LZW or ZIP) often reduces a file to less than half its original size, which means that an original file and a backup can be stored in the same amount of space as an uncompressed file.

Not all compression is equal. Disallowing lossy compression in archival TIFF files may make sense for other reasons, and TIFF’s original JPEG compression scheme is deprecated. But insisting on uncompressed files to improve their ability to withstand bit rot strikes me as a foolish precaution.

Comments are closed.