Recently I came across the term “fuzzing” for intentionally damaging files to test the software that reads them. Most of the material I’ve found doesn’t provide a useful introduction; they assume that if you know the term, you already understand something about it. One good article is “Fuzzing — Mutation vs. Generation” on the Infosec website. According to that article, fuzzing denotes the response to file changes rather than the changes themselves, but I’m seeing the term used mostly in the latter sense.
Whether it’s called “fuzzing” or “fuzz testing,” it’s certainly a useful concept. Errors in a file can cause random misbehavior or crash an application. In the worst case, they create security holes that deliberately malformed files can exploit. The article talks about two approaches: “mutation,” or random changes, and “generation,” or changes based on an understanding of the format. Generation is better at targeting weaknesses that the test creator can think of, but mutation can catch errors no one has thought about.
Fuzz testing is important for software that handles untrusted files, including repository ingest software and public Web applications. Files containing invalid pointers and out-of-bounds data lengths are good places to start. It’s especially important with software in languages like C that don’t automatically do bounds checking. (I really don’t understand why people still write code to operate on untrusted data in those languages. As computers keep getting faster, can’t some speed be traded off for security?)
zzuf is a widely used fuzzer. On a Debian system, you can get it with “apt-get install zzuf”. It does random munging of files, and you can set the proportion of bits that will be changed with the -r parameter.
I tried fuzzing a JPEG file, first with the default r of 0.004, then with an r of 0.05 (mung 1 bit in 20). ExifTool and file were both happy with the first file, but JHOVE recognized it was broken (“Expected marker byte 255, got 253”). All three tools said the second file wasn’t a JPEG file at all.
Using zzuf or other fuzzing tools seems like an important part of testing file validation software, though I haven’t seen it discussed in a digital preservation context.
Susan Thomas mentions a tool written by Dr Manfred Thaller to simulate digital aging by randomly zeroing bytes in a target file
(https://github.com/mcarden/shotgun)
http://blogs.bodleian.ox.ac.uk/archivesandmanuscripts/2008/07/31/dpcs-preservation-planning-workshop/