HTML and fuzzy validity

Andy Jackson wrote an interesting post on the question of HTML validity. Only registered Typepad users can comment, so it’s easier for me to add something to the discussion here.

When I worked on JHOVE, I had to address the question of valid HTML. A few issues are straightforward; the angle brackets for tags have to be closed, and so do quoted strings. Beyond that, everything seems optional. There are no required elements in HTML, not even html, head, or body; a blank file or a plain text file with no tags can be a valid HTML document. The rules of HTML are designed to be forgiving, which just makes it harder to tell if a document is valid or not. I’ve recommended that JHOVE users not use the HTML module; it’s time-consuming and doesn’t give you much useful information.

There are things in XHTML which aren’t legal in HTML. The “self-closing” tag (<tag/>) is good XHTML, but not always legal HTML. In HTML5, <input ... /> is legal, but <span ... /> isn’t, because input doesn’t require a closing tag but span does. (In other words, it’s legal only when it’s superfluous.) However, any recent browser will accept both of them.

The set of HTML documents which are de facto acceptable and unambiguous is much bigger than the set which is de jure correct. Unfortunately, the former is a fuzzy set. How far can you push the rules before you’ve got unsafe, ambiguous HTML? It depends on which browsers and versions you’re looking at, and how strenuous your test cases are.

The problem goes beyond HTML proper. Most browsers deal with improper tag nesting, but JavaScript and CSS can raise bigger issues. These are very apt to have vendor-specific features, and they may have major rendering problems in browsers for which they weren’t tested. A document with broken JavaScript can be perfectly valid, as far as the HTML spec is concerned.

It’s common for JavaScript to be included by an external reference, often on a completely different website. These scripts may themselves have external dependencies. Following the dependency chain is a pain, but without them all the page may not work properly. I don’t have data, but my feeling is that far more web pages are broken because of bad scripts and external references than because of bad HTML syntax.

So what do you do when validating web pages? Thinking of it as “validating HTML” pulls you into a messy area without addressing some major issues. If you insist on documents that are fully compliant with the specs, you’ll probably throw out more than you accept, without any good reason. But at the same time, unless you validate the JavaScript and archive all external dependencies, you’ll accept some documents that have significant preservation issues.

It’s a mess, and I don’t think anyone has a good solution.

Update: Andy Jackson has written a post responding to this, Web Archiving in the JavaScript Age.

Comments are closed.