Articles about JHOVE, such as Good GIF Hunting, grab my attention for obvious reasons. This article talks about false positive and negative results, and got me to thinking: What constitutes a “positive” result in file format validation? There are two ways to look at it:
- The default assumption is that the file is of a certain format, perhaps based on its extension, MIME type, or other metadata. The software sets out to see if it violates the format’s requirements. In that case, a positive result is that the file doesn’t conform to the requirements.
- The default assumption is that the file is just a collection of bytes. The software matches it against one or more sets of criteria. A positive result is that the file matches one of them.
Some software take one approach, some takes the other. Both approaches have their uses. JHOVE follows the second approach; it runs several modules on a file, and if none of them say it’s well-formed and valid, then it’s a “bytestream.” Single-format validators, such as VeraPDF, initially assume that a file is an instance of the format and try to find deviations from the standard. In that case, it’s a “positive” result if the file isn’t a proper PDF/A.
The disadvantage of JHOVE’s approach is that if a file almost matches a format but has defects, it will treat the result as a negative. It will call the file a bytestream, which really means it has no clue what it is. This is fine when you only want valid files and have to discard the rest. If you know that a file is supposed to be an instance of a format (GIF in the ODF article), then you want to know how it failed. From that perspective, a positive (interesting) result is a report that it isn’t a GIF. The article treats “not GIF” as a positive. The reversal reminds us that JHOVE is at least slightly out of its element when focusing of files that are assumed to have a known format.
The difference among validation tools isn’t just how many false matches or mismatches they report, but what they tell you about the mismatches. JHOVE hasn’t generally been great about telling you why a file isn’t well-formed or valid; its messaages are often cryptic and technical (though they’ve been getting better since OPF took over maintenance from me). We’re talking about two different approaches to software design, with different purposes.
JHOVE’s main purpose is to answer the question: “Is this file, which we initially know nothing about, a good instance of a known format?” VeraPDF addresses the question: “Is this file, which purports to be a PDF/A, really a good PDF/A?” They start from different assumptions. Both kinds of software have their purpose, and the best choice depends heavily on what question you need to answer.