For years I wrote most of the code for JHOVE. With each format, I wrote tests for whether a file is “well-formed” and “valid.” With most formats, I never knew exactly what these terms meant. They come from XML, where they have clear meanings. A well-formed XML file has correct syntax. Angle brackets and quote marks match. Closing tags match opening tags. A valid file is well-formed and follows its schema. A file can be well-formed but not valid, but it can’t be valid without being well-formed.
With most other formats, there’s no definition of these terms. JHOVE applies them anyway. (I wrote the code, but I didn’t design JHOVE’s architecture. Not my fault.) I approached them by treating “well-formed” as meaning syntactically correct, and “valid” as meaning semantically correct. Drawing the line wasn’t always easy. If a required date field is missing, is the file not well-formed or just not valid? What if the date is supposed to be in ISO 8601 format but isn’t? How much does it matter?
Unicode characters ought to have a specific denotation, even if their exact appearance depends on the font. A letter, a punctuation mark, or a Chinese ideograph should have the same meaning to everyone who reads it. There are problems, of course. There’s no systematic difference in appearance between A, the first letter of the Roman alphabet, and Α, Alpha, the first letter of the Greek alphabet. (However, when I had my computer read this article aloud to me for proofreading, it pronounced the latter as “Greek capital letter alpha”! Nice! It also pronounced the names of the emoji in this article, except the new ones in Unicode 11.0.) In some fonts, you can’t even tell the lower case letter l from the number 1 without looking carefully. This problem allows homograph attacks and “typosquatting.”
But the worst problem is with the Unicode Consortium’s great headache, emoji. These picture characters have just brief verbal descriptions in the Unicode standard, and font designers for different companies produce renderings that have vastly different connotations. Motherboard offers a sampling of the varied renderings. Here’s the “grimacing face” from Apple, Google, Samsung, and LG respectively.
There are two ways to put 3D models into a PDF file. Neither of them is an extension of the two-dimensional PDF model. Rather, they’re technologies which were developed independently, which can be wrapped into a PDF, and which software such as Adobe Acrobat can work with.
PDF has become a container format as much as a representational format. It can hold anything, and some of the things it holds have more or less official status, but there are no common architectural principles. The two formats used with PDF are U3D and PRC. Both are actually independent file formats which a PDF can embed.
Is TIFF a legacy format?
The most recent version of the TIFF specification, 6.0, dates from 1992. Adobe updated it with three technical notes, the latest coming out in 2002. Since then there has been nothing.
The format is solid, but the past quarter-century has seen reasons to enhance it. BigTIFF is a variant of the format to accommodate larger files. It isn’t backward-compatible with TIFF, but the changes mostly concern data lengths and are easy to add to a TIFF interpreter. The format sits in a kind of limbo, since Adobe owns the spec but is no longer updating it. There have been new tags which have achieved consensus acceptance but don’t have official status. AWare Systems has a list of known tags but has no reliable way to say which ones are private and which are generally accepted. There’s no way to add a new compression or encryption algorithm, or any other new feature, and give it official status.
The ISO specification for PDF 2.0 is now out. It’s known as ISO 32000-2. As usual for ISO, it costs an insane 198 Swiss francs, which is roughly the same amount in dollars. In the past, Adobe has made PDF specifications available for free on its own site, but I can’t find it on adobe.com. Its PDF reference page still covers only PDF 1.7.
ISO has to pay its bills somehow, but it’s not good if the standard is priced so high that only specialists can afford it. I don’t intend to spend $200 to be able to update JHOVE without pay. With some digging, I’ve found it in an incomplete, eyes-only format. All I can view is the table of contents. There are links to all sections, but they don’t work. I’m not sure whether it’s broken on my browser or by intention. In any case, it’s a big step backward as an open standard. I hope Adobe will eventually put the spec on its website.
HTML 5.1 is now a W3C proposed recommendation, and the comment period has closed. If no major issues have turned up, it may become a recommendation soon, susperseding HTML 5.0.
Browsers already support a large part of what it includes, so a discussion of its “new” features will cover ones that people already thought were a part of HTML5. The implementations of HTML are usually ahead of the official documents, with heavy reliance on working drafts in spite of all the disclaimers. Things like the
picture element are already familiar, even though they aren’t in the 5.0 specification.
A project to define an archive-safe subset of TIFF has been going on for a long time. Originally it was called the TIFF/A initiative, but Adobe wouldn’t allow the use of the TIFF trademark, so it’s now called the TI/A initiative.
So far it’s been very closed in what it presents to the public. It’s easy enough to sign up and view the discussions; I’ve done that, and I have professional credentials but no inside connections. However, it bothers me that it’s gone so long presenting nothing more to the public than just a white paper and no progress reports.
I’m not going to make anything public which they don’t want to, but I’ll just say that I have some serious disagreements with the approach they’re taking. When they finally do go public, I’m afraid they won’t get much traction with the archival community. Some transparency would have helped to determine whether I’m wrong or they’re wrong.
In a GitHub comment, Johan van der Knijff noted how messy it is to determine the version of a PDF file. He looked at a file with the header characters “%PDF-1.8”. DROID says this isn’t a PDF file at all.
By a strict reading of the PDF specification, it isn’t. The version number has to be in the range 1.0 through 1.7. Being this strict seems like a bad idea, since it would mean format recognition software will fail to recognize any future versions of the format. (JHOVE doesn’t care what character comes after the period.)
In 2001, the Unicode Consortium rejected a proposal to include the Klingon encoding. The reasons it gave were:
Lack of evidence of usage in published literature, lack of organized community interest in its standardization, no resolution of potential trademark and copyright issues, question about its status as a cipher rather than a script, and so on.
Fair enough, but don’t most of these objections apply equally to emoji?
TIFF is a very popular image format, but it can’t handle really huge files. “Really huge” means files bigger than 4 gigabytes, or more precisely, files in which any data offset can’t be represented in 32 bits. That’s not a limitation that comes up often, but some applications, such as medical scans, need enough detail to push the limit.
A dozen years ago, members of the TIFF community at AWare Systems came up with a simple idea: Create a variant of TIFF with 64-bit offsets instead of 32 bits. The result was BigTIFF.