Andy Jackson wrote an interesting post on the question of HTML validity. Only registered Typepad users can comment, so it’s easier for me to add something to the discussion here.
When I worked on JHOVE, I had to address the question of valid HTML. A few issues are straightforward; the angle brackets for tags have to be closed, and so do quoted strings. Beyond that, everything seems optional. There are no required elements in HTML, not even html, head, or body; a blank file or a plain text file with no tags can be a valid HTML document. The rules of HTML are designed to be forgiving, which just makes it harder to tell if a document is valid or not. I’ve recommended that JHOVE users not use the HTML module; it’s time-consuming and doesn’t give you much useful information.
There are things in XHTML which aren’t legal in HTML. The “self-closing” tag (<tag/>) is good XHTML, but not always legal HTML. In HTML5, <input ... /> is legal, but <span ... /> isn’t, because input doesn’t require a closing tag but span does. (In other words, it’s legal only when it’s superfluous.) However, any recent browser will accept both of them.
The set of HTML documents which are de facto acceptable and unambiguous is much bigger than the set which is de jure correct. Unfortunately, the former is a fuzzy set. How far can you push the rules before you’ve got unsafe, ambiguous HTML? It depends on which browsers and versions you’re looking at, and how strenuous your test cases are.
The problem goes beyond HTML proper. Most browsers deal with improper tag nesting, but JavaScript and CSS can raise bigger issues. These are very apt to have vendor-specific features, and they may have major rendering problems in browsers for which they weren’t tested. A document with broken JavaScript can be perfectly valid, as far as the HTML spec is concerned.
It’s common for JavaScript to be included by an external reference, often on a completely different website. These scripts may themselves have external dependencies. Following the dependency chain is a pain, but without them all the page may not work properly. I don’t have data, but my feeling is that far more web pages are broken because of bad scripts and external references than because of bad HTML syntax.
So what do you do when validating web pages? Thinking of it as “validating HTML” pulls you into a messy area without addressing some major issues. If you insist on documents that are fully compliant with the specs, you’ll probably throw out more than you accept, without any good reason. But at the same time, unless you validate the JavaScript and archive all external dependencies, you’ll accept some documents that have significant preservation issues.
It’s a mess, and I don’t think anyone has a good solution.
Update: Andy Jackson has written a post responding to this, Web Archiving in the JavaScript Age.
Pono’s file format
I’ve been seeing weirdly intense hostility to the Pono music player and service. A Business Insider article implies that it’s a scheme by Apple to make you buy your music all over again at higher prices. Another article complains that it will hold “only” 1,872 tracks and protests that “the Average person” (their capitalization) doesn’t hear any improvement. I wonder if some of these people are outraged because they’re confusing Pono with Bono and thinking this is the new copy-proof file format which he said Apple is working on.
In fact, Pono isn’t using any new format and isn’t introducing DRM. Its files are in the well-known FLAC format. FLAC stands for “Free Lossless Audio Codec.” The term technically refers only to the codec, not the container, but it’s usually delivered in a “Native FLAC” container. It can also be delivered in an Ogg container, providing better metadata support and slightly larger files.
The “lossless” part of the name refers to FLAC’s compression. MP3 uses lossy compression, which removes some information, sacrificing a little audio quality to make the file smaller. FLAC delivers larger files, giving better quality and a larger file size for the same sampling rate and bit resolution. According to CNET, “Pono’s recordings will range from CD-quality 16-bit/44.1kHz to 24-bit/192kHz “ultra-high resolution.” 96 kilohertz (dividing 192 by 2 per the Nyquist theorem) is way beyond the threshold of human hearing, so it’s understandable that people are skeptical about whether it offers any benefit over a lower sampling rate. Frequencies that high are normally filtered out.
FLAC is non-proprietary and DRM-free, and it has an open source reference implementation. Someone could put FLAC into a DRM container, but then why not use a proprietary encoding? Using FLAC is a step forward from the patent-encumbered MP3, with license requirements that effectively lock out free software.
iTunes doesn’t support FLAC files, so the Business Insider claim that Pono is Apple’s way of making you buy music over again is idiotic. It’s like saying Windows 8 is an Apple scheme to make you buy new software.
As the number of gigabytes you can stick in your pocket keeps growing, the need for compression decreases. For many people, amount of music storage takes priority over improved sound quality, but some will pay for a high-end player that gives them the best sound possible. I don’t get why this infuriates so many critics. At any rate, the file format shouldn’t scare anyone.
For more discussion of FLAC as it relates to Pono, see “What is FLAC? The high-def MP3 explained” on CNET’s site; the headline is totally wrong, but the article itself is good.
Comments Off on Pono’s file format
Posted in commentary
Tagged audio, FLAC, music, Pono