Monthly Archives: March 2011

XML Schema’s designed-in denial of service attack

Recently there was a discussion on the Library of Congress’s MODS mailing list, pointing out that the MODS Schema uses non-canonical URI’s for the xml.xsd and xlink.xsd schemas. The URI for xml.xsd simply points to a copy of the standard schema, but the xlink schema points at a modified version.

A person at LoC explained that the change to the XML URI was needed because the W3C server was being hammered by so many accesses by way of the MODS schema. Every time a MODS document was validated, unless the validating application used a local or cached copy, there would be an access to the W3C server. We’re told that “W3C was complaining (loudly) about excessive accesses and threatening to block certain clients.” The XLink issue is more complicated and not fully explained in the list discussion, but one part of the problem was the same issue.

The identification of XML namespaces with URI’s creates a denial-of-service attack against servers that host popular schemas, as an unintended consequence of the design. Since you can’t always know which schemas will become popular, this can create a huge burden on servers that aren’t prepared for it. The URI can never move without breaking the namespace for existing documents. I’ve written here before about this problem but hadn’t known it was so severe it was forcing important schemas to clone namespaces. This causes obvious conflicts when a MODS element is embedded within a document that uses the standard XML namespaces.

The only solution available is for applications either to keep a permanent local copy of heavily used schemas or to cache them. Unfortunately, not all applications are going to be fixed, and not all users will upgrade to the fixed versions. So we’ll continue to see cases where schema hosts are hammered with requests and performance somewhere else suffers for reasons the users can’t guess.

EXI is W3C recommendation

Efficient XML Interchange or EXI, the controversial binary representation of XML, is now a W3C standard. Unlike approaches which apply standard compression schemes to XML (e.g., Open Office’s XML plus ZIP), Efficient XML represents the structure of an XML document in a binary form. For some, this adds unnecessary obscurity to a format based on (somewhat) human-readable text. Others consider it a necessary step to reduce the bloat and slow processing of text XML.

The press release says: “EXI is a very compact representation of XML information, making it ideal for use in smart phones, devices with memory or bandwidth constraints, in performance sensitive applications such as sensor networks, in consumer electronics such as cameras, in automobiles, in real-time trading systems, and in many other scenarios.”

There are some things that can be done in XML but not in EXI. The W3C document says: “EXI is designed to be compatible with the XML Information Set. While this approach is both legitimate and practical for designing a succinct format interoperable with XML family of specifications and technologies, it entails that some lexical constructs of XML not recognized by the XML Information Set are not represented by EXI, either. Examples of such unrepresented lexical constructs of XML include white space outside the document element, white space within tags, the kind of quotation marks (single or double) used to quote attribute values, and the boundaries of CDATA marked sections.” Whether this is important will doubtless continue to be the subject of heated debate.

JHOVE2 tutorial at IS&T Archiving

Forwarded from Stephen Abrams:

The JHOVE2 project team will be presenting a one day tutorial on the use of JHOVE2 at the IS&T Archiving conference on May 16.


JHOVE2 is an open source framework and application for next generation format-aware characterization of digital objects. Characterization is the process of deriving representation information about a formatted digital object that is indicative of its significant nature and useful for purposes of classification, analysis, and use in digital curation, preservation, and repository contexts. JHOVE2 builds on the success of the original JHOVE characterization tool by addressing known limitations and offering significant new functions, including: object-focused, rather than file-focused, characterization; signature-based file level identification using DROID; aggregate-level identification based on configurable file system naming conventions; rules-based assessment to support determinations of object acceptability in addition to validation conformity; and extensive user configuration options.

The 2011 release of JHOVE2 represents the availability of a significant new tool for digital preservation; this course will provide a broad overview of JHOVE2, as well as detailed information on its functionality, architecture, use in local workflows, and open source community.

Course Objectives:

This short course will give attendees both a broad conceptual overview and detailed information on JHOVE2, and equip them to use the open source tool in their local environments. Specifically, the course will:

  • Define the role of file characterization, including identification, feature extraction, validation, and assessment, in digital curation and preservation workflows.
  • Review the functionality of the JHOVE2 application, including the significant enhancements relative to JHOVE, and new capabilities based on object- and aggregate-level characterization
  • Detail the architecture, componentry, design patterns and Java API’s of the JHOVE2 framework, as well as the configuration options for plug-in modules, characterization strategies and results formatting
  • Demonstrate the use of JHOVE2’s new rule-based assessment capabilities, and integrating these into local workflows to determine object acceptability
  • Cover the community framework for the project, and how individual institutions can both contribute new format modules as well as resources to help extend and sustain the open source project.

Intended Audience:

This course is designed for technologists and practitioners (developers, managers, analysts and administrators) engaged in digital curation, preservation, and repository activities, and whose work is dependent on an understanding of the format and pertinent characteristics of digital assets.

HTML5, just three years away

According to the latest version of the HTML Working Group Charter, HTML5 will become a W3C recommendation in 2014.

Smart money is on the AES audio metadata schema being made public first, but I wouldn’t be too sure.