Tag Archives: W3C

Misadventures in XML

Around 6 PM yesterday, our SMIL file delivery broke. At first I figured it for a database connection problem, but the log entries were atypical. I soon determined that retrieval of the SMIL DTD was regularly failing. Most requests would get an error, and those that did succeed took over a minute.

There’s a basic flaw in XML DTD’s and schemas (collectively called grammars). They’re identified by a URL, and by default any parser that validates documents by their grammar retrieves it from that URL. For popular ones, that means a lot of traffic. We’ve run into that problem with the JHOVE configuration schema, and that’s nowhere near the traffic a really popular schema must generate.

Knowing this, and also knowing that depending on an outside website’s staying up is a bad idea, we’ve made our own local copy of the SMIL DTD to reference. So I was extremely puzzled about why access to it had become so terrible. After much headscratching, I discovered a bug in the code that kept the redirection to the local DTD from working; we had been going to the official URL, which lives on w3.org, all along.

Presumably W3C is constantly hammered by requests for grammars which it originates, and presumably it’s fighting back by greatly lowering the priority of the worst offenders. Its server wasn’t blocking the requests altogether; that would have been easier to diagnose. The priority just got so low that most requests timed out.

Once I figured that out, I put in the fix to access the local DTD URL, and things are looking nicer now. Moving the fix to production will take a couple of days but should be routine.

The problem is inherent in XML: The definition of grammars is tied to a specific Web location. Aside from the problem of heavy traffic to there, this means the longevity of the grammar is tied to the longevity of the URL. It takes extra effort to make a local copy, and anyone starting out isn’t likely to encounter throttling right away, so the law of least effort says most people won’t bother to.

This got me wondering, as I started writing this post, why don’t parsers like Xerces cache grammars? It turns out that Xerces can cache grammars, though by default it doesn’t. As far as I can tell, this isn’t a well-known feature, and again the law of least effort implies that a lot of developers won’t take advantage of it. But it looks like a very useful thing. It should really be enabled by default, though I can understand why its implementers took the more cautious approach.

More on WOFF

W3C now has a press release on WOFF, which I discussed in an earlier post. The abbreviation WOFF has now acquired a name (the rather feeble “Web Open File Format”), and there’s a FAQ.

Previously it sounded like an interchange format to me. Now apparently it’s a format for use by Web browsers.

WOFF 1.0

W3C’s WebFonts Working Group has announced WOFF 1.0 (working draft), a format for encapsulating and compressing font data. The name WOFF apparently doesn’t stand for anything in particular. WOFF isn’t a font format apart from existing formats, but a way to package fonts on the Web. Additional metadata can be attached to a WOFF file to identify the font’s origin and restrictions.

WOFF working draft

Rule Interchange Format

W3C has announced Rule Interchange Format (RIF) as a new Recommendation. RIF is intended for porting rules (e.g., for filtering, categorization, business processes, etc.) among heterogeneous rule systems. It’s particularly aimed at the Semantic Web, as discussed here and here.

HTML5 and video

There’s an entry on the W3C blog about the state of HTML5 video. The most significant point is that “we still don’t have a baseline video codec for HTML5.” Without that, it’s silly to talk about HTML5 as an alternative to Flash or any other kind of video presentation. Microsoft is pushing H.264, and IE9 will support only H.264 under HTML5. Mozilla is going with Ogg Theora. Both codecs have patent issues, limiting the opportunities for third parties to fill in the gap. Both have enthusiastic advocates.

The Browser Wars are back.

So what is HTML 5 exactly?

Paul Cotton, co-chair from Microsoft on the W3C HTML Working Group, has some interesting comments on exactly what people mean by “HTML 5.” This may help explain some odd statements about “HTML video” which I’ve commented on in recent posts. The interview includes other remarks on the status of HTML 5.

First, I believe that most people use the term “HTML 5” to refer to the HTML 5 specification currently being worked on by the HTML WG. The HTML 5 specification defines the syntax and the semantics of the elements and attributes in the HTML markup language and several of the APIs that are used to process HTML documents. Recently the HTML WG has started to break the HTML 5 specification into more modular and separate Working Drafts e.g. HTML+RDFa, HTML Microdata, and HTML Canvas 2D Context. The HTML WG is also publishing two additional documents to aid users of HTML 5: the HTML 5 differences from HTML4 specification and HTML: The Markup Language which is aimed at developers that produce HTML 5 output.

Each of these additional Working Drafts are still part of “HTML 5” and are all on track to become separate but related W3C Recommendations or Working Group Notes. I believe that the content of these WDs taken together will define the part of “HTML 5” being worked on by the HTML WG.

But I believe that some use the term “HTML 5” to refer also to the important related API specifications being worked on by the WebApps WG. The WebApps WG is chartered to create client-side APIs that can be used with the HTML markup language – in fact some of its specifications started as part of the HTML 5 specification but were migrated over to be separate modular specifications managed by the WebApps WG. In addition there are some very interesting APIs under development by the Device APIs and Policy Working Group which are related to HTML 5 since they can be used with the HTML language and in user agents.

Others use the term “HTML 5” to also include the ECMAScript-262 Language which defines the programming language that we use today to build dynamic web applications.

XSD 1.1 reaches last call status

W3C XML Schema Definition Language 1.1 has reached the status of Last Call Working Draft. The Last Call period ends at the end of December.

HTML 5 updated

There’s a new working draft of HTML 5 available from W3C. It still has the same warning as in April: “Implementors should be aware that this specification is not stable. Implementors who are not taking part in the discussions are likely to find the specification changing out from under them in incompatible ways.

But lots of sections have been marked “Last call for comments,” so perhaps it really is closing in on a stable version. Or perhaps not. The most widely debated issue is video codecs, and I get the impression there’s been little progress on them. The situation is, in principle, similar to the <IMG> tag, where browsers explicitly aren’t required to support any particular image format; but it would be a poor (or text-only) browser that didn’t support JPEG and GIF, at least. With video there isn’t even that much agreement. Granted, the situation is just as bad now, but HTML 4 doesn’t even address the issue, so it isn’t held back by format disputes.

I’m looking at the HTML 5 wars from a rather uninformed distance, so don’t expect expert analysis here, just impatience with how slowly things are going. According to the WHATWG Wiki, it may reach Candidate Recommendation stage in 2012. The fact that the HTML working group now has three co-chairs just strikes me as a bad sign.