Should official online documents be PDF files? Many institutions say they obviously should, but the format has some clear disadvantages. An article on the UK’s Government Digital Service site argues that HTML, not PDF, is the right format for UK government documents. Its arguments, to the extent that they’re valid, apply to lots of other documents.
It makes a plausible case against PDF. The trouble is that the case against HTML is even stronger in some ways.
The limitations of PDF
The original idea of PDF was that it would define a document with a consistent appearance using any computer and software. The emphasis is on appearance, not content. In the nineties, that was enough. Today, the Internet’s emphasis is on flexible content delivery. Information may appear on a small screen or a big one, it may be printed, or the user may hear it read aloud. Fonts, colors, and sizes can change to suit the user’s needs and preferences.
In the general case, PDF does a poor job of presenting a document’s content. The sequence of text in a file may have little to do with its logical structure. As the article notes, this creates difficulties for accessing and extracting information.
Using tagged PDF and PDF/A to impose order and exclude certain features greatly improves the usability of documents. The article doesn’t mention these profiles but refers to them by implication. They require working from an original document, such as an OpenOffice file, to get good results; tagging a PDF after the fact is possible but usually won’t reflect the document’s structure very well.
The article does have a point. PDF files are inflexible and often don’t serve online needs well. How well it stacks up against HTML is another question.
Issues with HTML
HTML has its own problems for official documents. It doesn’t necessarily present a consistent view of a document. In fact, HTML5 is designed so that the same document can have a completely different appearance, even showing different content, in different environments.
Most HTML is broken. Files violate their own syntax rules, and browsers piece them back together. It’s not unusual for a file to fail to render at all on some browsers.
The exact appearance of a legally significant document is important. When the resolution of legal disputes depends on tiny details, differences in what the user sees can affect how courts rule.
Neither HTML nor PDF is entirely satisfactory for presenting official documents on the Internet. One is too free-wheeling, and the other emphasizes appearance over content. Perhaps what’s needed is a standard which defines a restricted HTML that’s comparable to PDF/A. It would guarantee consistent content.
It would be nice to have a standard like this. HTML that conformed to it would have a reliable appearance on any browser, and it wouldn’t be hard to make it backward compatible to many older browser releases. Creating the standard would be a major job, though, and I don’t think anyone with influence has been inclined to tackle it.
For all its problems, PDF/A has some strong advantages over HTML for documents that need stable and unambiguous content. There’s no fully satisfactory solution.