Category Archives: commentary

XML Schema’s designed-in denial of service attack

Recently there was a discussion on the Library of Congress’s MODS mailing list, pointing out that the MODS Schema uses non-canonical URI’s for the xml.xsd and xlink.xsd schemas. The URI for xml.xsd simply points to a copy of the standard schema, but the xlink schema points at a modified version.

A person at LoC explained that the change to the XML URI was needed because the W3C server was being hammered by so many accesses by way of the MODS schema. Every time a MODS document was validated, unless the validating application used a local or cached copy, there would be an access to the W3C server. We’re told that “W3C was complaining (loudly) about excessive accesses and threatening to block certain clients.” The XLink issue is more complicated and not fully explained in the list discussion, but one part of the problem was the same issue.

The identification of XML namespaces with URI’s creates a denial-of-service attack against servers that host popular schemas, as an unintended consequence of the design. Since you can’t always know which schemas will become popular, this can create a huge burden on servers that aren’t prepared for it. The URI can never move without breaking the namespace for existing documents. I’ve written here before about this problem but hadn’t known it was so severe it was forcing important schemas to clone namespaces. This causes obvious conflicts when a MODS element is embedded within a document that uses the standard XML namespaces.

The only solution available is for applications either to keep a permanent local copy of heavily used schemas or to cache them. Unfortunately, not all applications are going to be fixed, and not all users will upgrade to the fixed versions. So we’ll continue to see cases where schema hosts are hammered with requests and performance somewhere else suffers for reasons the users can’t guess.

EXI is W3C recommendation

Efficient XML Interchange or EXI, the controversial binary representation of XML, is now a W3C standard. Unlike approaches which apply standard compression schemes to XML (e.g., Open Office’s XML plus ZIP), Efficient XML represents the structure of an XML document in a binary form. For some, this adds unnecessary obscurity to a format based on (somewhat) human-readable text. Others consider it a necessary step to reduce the bloat and slow processing of text XML.

The press release says: “EXI is a very compact representation of XML information, making it ideal for use in smart phones, devices with memory or bandwidth constraints, in performance sensitive applications such as sensor networks, in consumer electronics such as cameras, in automobiles, in real-time trading systems, and in many other scenarios.”

There are some things that can be done in XML but not in EXI. The W3C document says: “EXI is designed to be compatible with the XML Information Set. While this approach is both legitimate and practical for designing a succinct format interoperable with XML family of specifications and technologies, it entails that some lexical constructs of XML not recognized by the XML Information Set are not represented by EXI, either. Examples of such unrepresented lexical constructs of XML include white space outside the document element, white space within tags, the kind of quotation marks (single or double) used to quote attribute values, and the boundaries of CDATA marked sections.” Whether this is important will doubtless continue to be the subject of heated debate.

HTML5, just three years away

According to the latest version of the HTML Working Group Charter, HTML5 will become a W3C recommendation in 2014.

Smart money is on the AES audio metadata schema being made public first, but I wouldn’t be too sure.

The HTML5 logo again

In an earlier post, I questioned how W3C’s new HTML5 logo could help provide a “consistent, standardized visual vocabulary” when it stood for nothing in particular. Others have taken even stronger positions than mine, and W3C has backtracked. The HTML5 logo now stands for HTML5, not for HTML5, CSS3, H.264, and every other “cool” technology showing up on the web these days.

It’s still, as I noted, not a mark of conformance or certification, so its use on a website proves nothing, but at least now what it’s claiming to say is clearer.

SourceForge security incident and doppelgänger characters

This morning I got an email from SourceForge saying that all passwords had been reset because of a password sniffing incident. Naturally, I’m suspicious of all email of this kind, but I do have a SourceForge account. So rather than follow any of the links in the mail, I tried to log in normally and found that passwords were in fact reset. I followed the procedure for resetting by email and my account’s working again.

I’m sure some of you reading this also have SourceForge accounts, so this bit of reassurance may be helpful, especially if your phishing filters (philters?) kept you from seeing the notice in the first place. It’s likely some fakers will set up scams to take advantage of this issue, so always go to the SourceForge website by typing in the URL or using a bookmark, rather than by following a link from email. It’s easy to mistake a near-lookalike URL on a quick glance.

Worse yet (yes, this post has something to do with formats), there are now exact lookalike URL’s, thanks to the unfortunate policy of allowing Unicode in URL’s. There are numerous cases where characters in non-English character sets normally look just like letters of the Roman alphabet. Someone could, in principle, register sourceforgе.net, which looks just like sourceforge.net — but do a local text search for “sourceforge” in your browser, and you’ll notice the first “sourceforgе.net” (and this one) are skipped over. The sixth letter isn’t the ASCII letter “e” but the Russian letter “e,” which usually looks the same or very nearly.

If your browser doesn’t have a Cyrillic font, you may be seeing a placeholder glyph instead. Or if it views the page in Latin-1 instead of UTF-8, you may see a Capital D followed by a Greek lower-case mu.

With any email that offers to correct a password issue, exercise extreme caution, even though some are legitimate.

LOC irony

The Library of Congress Digital Preservation Newsletter (latest issue, subscription page) has some very nice content, but it’s ironic that the newsletter is delivered with the nondescript file name of 201101.pdf and that (if JHOVE is right) it doesn’t conform to PDF-A. A PDF/A document can’t have external links, so its lack is excusable; it’s the meaningless file name that actually bugs me more from a preservation standpoint.

I can’t find an editorial contact address on the newsletter to mention this to.

Secrets of building JHOVE2

The current beta of JHOVE2 is rather tricky to build. With some help from Marisa Strong, I’ve managed to do it. Here’s a guide which may be helpful.

1. Download JHOVE2. If you have Mercurial, follow the instructions. Otherwise use the “Get Source” menu item to get the .gz file.

2. Get a current version of Maven if you don’t have one.

3. If got the gzip file, expand it and the tarball which it contains. This will create a main directory.

4. cd main. The first recommendation is to run mv compile, but this apparently requires an environment which isn’t released yet, so instead do

mvn assembly:assembly -DskipTests

5. cd into the target directory. This will have the file jhove2-2.0.0.zip. Unzip this in place.

6. The directory jhove2-2.0.0 was just created. cd into it. This contains the script jhove2.sh. Run this from the command line with no arguments, and you’ll get a usage message if everything worked correctly.

To do stuff with JHOVE2, the user guide (PDF) is helpful.

Misadventures in XML

Around 6 PM yesterday, our SMIL file delivery broke. At first I figured it for a database connection problem, but the log entries were atypical. I soon determined that retrieval of the SMIL DTD was regularly failing. Most requests would get an error, and those that did succeed took over a minute.

There’s a basic flaw in XML DTD’s and schemas (collectively called grammars). They’re identified by a URL, and by default any parser that validates documents by their grammar retrieves it from that URL. For popular ones, that means a lot of traffic. We’ve run into that problem with the JHOVE configuration schema, and that’s nowhere near the traffic a really popular schema must generate.

Knowing this, and also knowing that depending on an outside website’s staying up is a bad idea, we’ve made our own local copy of the SMIL DTD to reference. So I was extremely puzzled about why access to it had become so terrible. After much headscratching, I discovered a bug in the code that kept the redirection to the local DTD from working; we had been going to the official URL, which lives on w3.org, all along.

Presumably W3C is constantly hammered by requests for grammars which it originates, and presumably it’s fighting back by greatly lowering the priority of the worst offenders. Its server wasn’t blocking the requests altogether; that would have been easier to diagnose. The priority just got so low that most requests timed out.

Once I figured that out, I put in the fix to access the local DTD URL, and things are looking nicer now. Moving the fix to production will take a couple of days but should be routine.

The problem is inherent in XML: The definition of grammars is tied to a specific Web location. Aside from the problem of heavy traffic to there, this means the longevity of the grammar is tied to the longevity of the URL. It takes extra effort to make a local copy, and anyone starting out isn’t likely to encounter throttling right away, so the law of least effort says most people won’t bother to.

This got me wondering, as I started writing this post, why don’t parsers like Xerces cache grammars? It turns out that Xerces can cache grammars, though by default it doesn’t. As far as I can tell, this isn’t a well-known feature, and again the law of least effort implies that a lot of developers won’t take advantage of it. But it looks like a very useful thing. It should really be enabled by default, though I can understand why its implementers took the more cautious approach.

JHOVE2 goes to beta

The JHOVE2 team has announced a beta release:

This beta code release supports all the major technical objectives of the project, including a more sophisticated, modular architecture; signature-based file identification; policy-based assessment of objects; recursive characterization of objects comprising aggregate files and files arbitrarily nested in containers; and extensive configuration and reporting options. The release also continues to fill out the roster of supported formats, with modules for ICC color profiles, SGML, Shapefile, TIFF, UTF-8, WAVE, and XML.

The source code page provides the source as a Mercurial repository, or as a single download. The gzip download expands into a file called main-14e8a6102f63 and it isn’t at all obvious what to do with it. Chmoding it to an executable and running it doesn’t work. I’ve asked what this is supposed to be; I’ll update this post when I get a response.

Update: That’s a tarball. Adding the .tar extension and using tar -xvf works nicely.