Andy Jackson wrote an interesting post on the question of HTML validity. Only registered Typepad users can comment, so it’s easier for me to add something to the discussion here.
When I worked on JHOVE, I had to address the question of valid HTML. A few issues are straightforward; the angle brackets for tags have to be closed, and so do quoted strings. Beyond that, everything seems optional. There are no required elements in HTML, not even
body; a blank file or a plain text file with no tags can be a valid HTML document. The rules of HTML are designed to be forgiving, which just makes it harder to tell if a document is valid or not. I’ve recommended that JHOVE users not use the HTML module; it’s time-consuming and doesn’t give you much useful information.
There are things in XHTML which aren’t legal in HTML. The “self-closing” tag (
<tag/>) is good XHTML, but not always legal HTML. In HTML5,
<input ... /> is legal, but
<span ... /> isn’t, because
input doesn’t require a closing tag but
span does. (In other words, it’s legal only when it’s superfluous.) However, any recent browser will accept both of them.
The set of HTML documents which are de facto acceptable and unambiguous is much bigger than the set which is de jure correct. Unfortunately, the former is a fuzzy set. How far can you push the rules before you’ve got unsafe, ambiguous HTML? It depends on which browsers and versions you’re looking at, and how strenuous your test cases are.
It’s a mess, and I don’t think anyone has a good solution.