PDF or HTML for public documents?

Should official online documents be PDF files? Many institutions say they obviously should, but the format has some clear disadvantages. An article on the UK’s Government Digital Service site argues that HTML, not PDF, is the right format for UK government documents. Its arguments, to the extent that they’re valid, apply to lots of other documents.

It makes a plausible case against PDF. The trouble is that the case against HTML is even stronger in some ways.

The limitations of PDF

The original idea of PDF was that it would define a document with a consistent appearance using any computer and software. The emphasis is on appearance, not content. In the nineties, that was enough. Today, the Internet’s emphasis is on flexible content delivery. Information may appear on a small screen or a big one, it may be printed, or the user may hear it read aloud. Fonts, colors, and sizes can change to suit the user’s needs and preferences.

In the general case, PDF does a poor job of presenting a document’s content. The sequence of text in a file may have little to do with its logical structure. As the article notes, this creates difficulties for accessing and extracting information.

Using tagged PDF and PDF/A to impose order and exclude certain features greatly improves the usability of documents. The article doesn’t mention these profiles but refers to them by implication. They require working from an original document, such as an OpenOffice file, to get good results; tagging a PDF after the fact is possible but usually won’t reflect the document’s structure very well.

The article does have a point. PDF files are inflexible and often don’t serve online needs well. How well it stacks up against HTML is another question.

Issues with HTML

HTML has its own problems for official documents. It doesn’t necessarily present a consistent view of a document. In fact, HTML5 is designed so that the same document can have a completely different appearance, even showing different content, in different environments.

Most HTML is broken. Files violate their own syntax rules, and browsers piece them back together. It’s not unusual for a file to fail to render at all on some browsers.

HTML files routinely have dependencies on other files, often from different hosts. Even a simple file with no JavaScript usually depends on image files and CSS for its appearance.

The exact appearance of a legally significant document is important. When the resolution of legal disputes depends on tiny details, differences in what the user sees can affect how courts rule.

Document-quality HTML?

Neither HTML nor PDF is entirely satisfactory for presenting official documents on the Internet. One is too free-wheeling, and the other emphasizes appearance over content. Perhaps what’s needed is a standard which defines a restricted HTML that’s comparable to PDF/A. It would guarantee consistent content.

This “HTML/A” would be free of JavaScript and third-party files. Conforming files would strictly follow HTML5 syntax. Validation would apply to a set of files, including images and CSS. Only files of specified types would be allowed. To be complete, it would need to specify what those files had to conform to.

It would be nice to have a standard like this. HTML that conformed to it would have a reliable appearance on any browser, and it wouldn’t be hard to make it backward compatible to many older browser releases. Creating the standard would be a major job, though, and I don’t think anyone with influence has been inclined to tackle it.

“Pure” HTML, with no JavaScript or CSS and with validated compliance, would be one way to go. That wouldn’t produce responsive pages, though. Images would remain the same size regardless of the viewing environment. This isn’t what the Government Digital Service is looking for.

For all its problems, PDF/A has some strong advantages over HTML for documents that need stable and unambiguous content. There’s no fully satisfactory solution.

One response to “PDF or HTML for public documents?

  1. A standard that fulfills all these requirements do exist,or maybe more correct is that it once existed, Open document architecture CCITT T.411… also ISO 8613. A document according to ODA could be stored formatted, formatted/processable or processable. It’s really nice thinking but never took off. Probably partly due to it’s horrible storage format based on ASN.1. And of course the major players in this market then, Word, Wordperfect and others was happy with their propriety formats