Figuring out the PDF version is harder than you think

In a GitHub comment, Johan van der Knijff noted how messy it is to determine the version of a PDF file. He looked at a file with the header characters “%PDF-1.8”. DROID says this isn’t a PDF file at all.

By a strict reading of the PDF specification, it isn’t. The version number has to be in the range 1.0 through 1.7. Being this strict seems like a bad idea, since it would mean format recognition software will fail to recognize any future versions of the format. (JHOVE doesn’t care what character comes after the period.)

But, he goes on to note, figuring out the actual version of a file is harder than that. The specification (section 7.5.2) says:

Beginning with PDF 1.4, the Version entry in the document’s catalog dictionary (located via the Root entry in the file’s trailer, as described in 7.5.5, “File Trailer”), if present, shall be used instead of the version specified in the Header.

Johan says this means the version number in the header is deprecated. That’s not correct; in the absence of a version number in the document catalog dictionary, the header specifies the version, and the header is always required to give a version. The catalog dictionary can override it.

What use case might have prompted this feature? Most likely it’s the common practice of editing a PDF file by appending to it. PDF is messy to change, and appending to a file is generally the safest approach. While doing this, an editor might add features that belong to a later version than the one the original file followed. By putting the new version number in the catalog dictionary, it avoids even so simple a change as overwriting the header bytes.

This means, though, that reliably identifying the version of a file requires parsing a dictionary, which is one of the harder things to implement in a PDF parser. This is a problem for software that relies on “magic numbers” to identify files.

The JHOVE module for PDF has a comment, which I probably wrote, saying, “The implementation notes (though not the spec) allow an alternative signature of %!PS-Adobe-N.n PDF-M.m … However, this is not PDF/A compliant.” I can’t figure out what implementation notes this refers to, but JHOVE does allow a header in that format. At this point, I can’t say whether it should do that or not. JHOVE checks the catalog dictionary and will override the header’s version number if it finds a Version entry, per the spec.

We can’t really expect software such as DROID and the command line file to parse PDF dictionaries, since they deal with huge numbers of different formats. This means they’ll sometimes report an incorrect version number. It seems unavoidable. How much of a problem this is depends on how common the use case I described is. I suspect it’s rare, but I don’t have any data.

One response to “Figuring out the PDF version is harder than you think

  1. If you do want a full parse, check out Apache Tika and Apache PDFBox. Johan helped us get this right, too: