Tying XML schemas to URIs was the worst mistake in the history of XML. Once you publish a schema URI and people start using it, you can’t change it without major disruption.
URIs aren’t permanent. Domains can disappear or change hands. Even subdomains can vanish with organizational changes. When I was at Harvard, I offered repeated reminders that
hul.harvard.edu can’t go away with the deprecation of the name “Harvard University Library/Libraries,” since it houses schemas for JHOVE and other applications. Time will tell whether it will stay.
Strictly speaking, a URI is a Uniform Resource identifier and has no obligation to correspond to a web page; W3C says a URI as a schema identifier is only a name. In practice, treating it as a URL may be the only way to locate the XSD. When a URI uses the
http scheme, it’s an invitation to use it as a URL.
Even if a domain doesn’t go away, it can be burdened with schema requests beyond its hosting capacity. The Harvard Library has been trying to get people to upgrade to the current version of JHOVE, which uses an entity resolver, but its server was, the last I checked, still heavily hit by three sites that hadn’t upgraded. They don’t pay anything, so there’s no money to put into more server capacity.
The best solution available is for software to resolve schema names to local copies (e.g. with Java’s
EntityResolver). This solution often doesn’t occur to people until there’s a problem, though, and by then there may be lots of copies of the old software out in the field.
For archival storage, keeping a copy of any needed schema files should be a requirement. Resources inevitably disappear from the Web, including schemas. My impression is that a lot of digital archives don’t have such a rule and blithely assume that the resources will be available on the Web forever. This is a risk which could be eliminated at virtually zero cost, and it should be, but my impression is that a lot of archives don’t do this.
It’s legitimate to stop making a URI usable as a URL, though it may be rude. W3C’s Namespaces in XML 1.0 says: “The namespace name, to serve its intended purpose, SHOULD have the characteristics of uniqueness and persistence. It is not a goal that it be directly usable for retrieval of a schema (if any exists).” (Emphasis added) That implies that any correct application really should do its own URI resolution.
One thing that isn’t legitimate, but I’ve occasionally seen, is replacing a schema with a new and incompatible version under the same URI. That can cause serious trouble for files that use the old schema. A new version of a schema needs to have a new URI.
The schema situation creates problems for hosting sites, applications, and archives. It’s vital to remember that you can’t count on the URI’s being a valid URL in the long term.
If you’ve got one of those old versions of JHOVE (1.5 and older, I think), please upgrade. The new versions are a lot less buggy anyway.
OOXML: The good and the bad
An article by Markus Feilner presents a very critical view of Microsoft’s Open Office XML as it currently stands. There are three versions of OOXML — ECMA, Transitional, and Strict. All of them use the same extensions, and there’s no easy way for the casual user to tell which variant a document is. If a Word document is created on one computer in the Strict format, then edited on another machine with an older version of Word, it may be silently downgraded to Transitional, with resulting loss of metadata or other features.
On the positive side, Microsoft has released the Open XML SDK as open source on Github. This is at least a partial answer to Feilner’s complaint that “there are no free and open source solutions that fully support OOXML.”
Incidentally, I continue to hate Microsoft’s use of the deliberately confusing term “Open XML” for OOXML.
Thanks to @willpdp for tweeting the links referenced here.
Posted in commentary
Tagged Microsoft, standards, XML