How to approach the file format validation problem

For years I wrote most of the code for JHOVE. With each format, I wrote tests for whether a file is “well-formed” and “valid.” With most formats, I never knew exactly what these terms meant. They come from XML, where they have clear meanings. A well-formed XML file has correct syntax. Angle brackets and quote marks match. Closing tags match opening tags. A valid file is well-formed and follows its schema. A file can be well-formed but not valid, but it can’t be valid without being well-formed.

With most other formats, there’s no definition of these terms. JHOVE applies them anyway. (I wrote the code, but I didn’t design JHOVE’s architecture. Not my fault.) I approached them by treating “well-formed” as meaning syntactically correct, and “valid” as meaning semantically correct. Drawing the line wasn’t always easy. If a required date field is missing, is the file not well-formed or just not valid? What if the date is supposed to be in ISO 8601 format but isn’t? How much does it matter?

Is validation the right criterion?

Paul Wheatley’s article, “A valediction for validation?” asks these and other questions about validation. It’s traditional in digital preservation to insist that files be validated before archiving, but is that always the right criterion? If a file is missing a date field or has it in the wrong format, how much does that detract from its preservation value? Conversely, a file which has a broken link to an important URL may be valid yet not very useful.

Software should be strict when creating files but lenient when reading them. This makes it harder to find a clear boundary between good and bad ones. Many files which don’t strictly comply with their format specifications cause no problems. Some formats don’t even have clear specifications. TIFF in practice is a combination of a 1992 specification, several technical notes, and widely accepted traditions.

The assumption built into JHOVE is that people in the future will discover files in a given format and the specification of the format, and they’ll have to recover the content based on just that information. This isn’t necessarily a reasonable assumption. It’s more likely that there will be some continuity between today’s world and that future. Today’s software may be available, even if it requires special equipment. If there isn’t any continuity, even recovering and understanding the spec may be impossible.

A risk-based approach

Perhaps, instead of declaring files valid or invalid, it would make more sense to assign risk factors to files. This would allow archiving of files that have some problems. Archivists would make a judgment based on the value of the material.

Suppose you found a recording of Abraham Lincoln giving a speech. (It’s just barely possible.) Would you throw it away because the quality was terrible? What’s worth archiving is a contextual judgment.

Here’s a possible set of risk levels that could provide a framework for such judgments.

  • No problems: Any software that handles the format should be able to open the file and present its content without loss of information.
  • Minor defects: The file is missing some required information or holds it in a way that may present problems. Any applicable software should be able to open the file, but it may be missing some expected information. The central content is accessible.
  • Recoverably damaged: Some important information is missing or degraded. Software may need to perform error recovery on the file. It’s possible to extract a significant amount of central content.
  • Seriously damaged: Only a limited amount of central content is available. Software with special recovery capabilities may be necessary.
  • Irrecoverable: No significant information is available. Some metadata may be available, but there is no usable central content to give it a context. The file’s format may just have been misidentified.

To concretize this, consider a hypothetical set of successively more broken TIFF images.

  • No problems: The file works with any reasonable software, and all required metadata fields are present.
  • Minor defects: Some fields have incorrect tag data types (TIFF is vague in this requirement) or some characters can’t be rendered, but there’s nothing worse.
  • Recoverably damaged: The full-resolution image is good, but the thumbnail can’t be rendered.
  • Seriously damaged: Software can render half the image before breaking off into visual noise.
  • Irrecoverable: This “TIFF” file is really PDF.

These categories don’t have rock-solid boundaries, of course. A file which seems irrecoverable may be recoverably damaged with better software. An “irrecoverable” file may be a perfectly good file in a different format. What they do is provide levels of risk for judging how to deal with files. Unimportant files may be acceptable only if they have minor defects at most. A recording of Lincoln would be worth keeping even if it’s seriously damaged.

Keep, repair, or discard?

Software can sometimes turn files with minor defects or recoverable damage into good files, but there’s a risk. Removing the damage may remove information that better software could have recovered. Paul points out the risk when he asks, “What do we actually want to achieve?” If a “defect” violates the letter of the spec but causes no problems in rendering the file, the risk from repairing it may be greater than the risk from leaving it alone. Of course, archiving the original along with the corrected version mitigates the risk. Archives should keep the original unless storage limitations make keeping both impractical.

Those format recovery people (or visiting aliens) in the far future will be smart enough to reconstruct software from a specification. They aren’t likely to be so rigid that they’ll give up when files don’t exactly follow the spec. Recovery will be a trial-and-error process, and if files have small problems such as not being aligned on even byte boundaries, they aren’t just going to throw up their hands or tentacles.

Adopting a risk-based approach to file validation will let archivists draw better judgments, taking into account the varying needs of different cases. It would be an improvement over artificially precise categories.

Comments are closed.