The URI namespace problem

Tying XML schemas to URIs was the worst mistake in the history of XML. Once you publish a schema URI and people start using it, you can’t change it without major disruption.

URIs aren’t permanent. Domains can disappear or change hands. Even subdomains can vanish with organizational changes. When I was at Harvard, I offered repeated reminders that can’t go away with the deprecation of the name “Harvard University Library/Libraries,” since it houses schemas for JHOVE and other applications. Time will tell whether it will stay.

Strictly speaking, a URI is a Uniform Resource identifier and has no obligation to correspond to a web page; W3C says a URI as a schema identifier is only a name. In practice, treating it as a URL may be the only way to locate the XSD. When a URI uses the http scheme, it’s an invitation to use it as a URL.

Even if a domain doesn’t go away, it can be burdened with schema requests beyond its hosting capacity. The Harvard Library has been trying to get people to upgrade to the current version of JHOVE, which uses an entity resolver, but its server was, the last I checked, still heavily hit by three sites that hadn’t upgraded. They don’t pay anything, so there’s no money to put into more server capacity.

The best solution available is for software to resolve schema names to local copies (e.g. with Java’s EntityResolver). This solution often doesn’t occur to people until there’s a problem, though, and by then there may be lots of copies of the old software out in the field.

For archival storage, keeping a copy of any needed schema files should be a requirement. Resources inevitably disappear from the Web, including schemas. My impression is that a lot of digital archives don’t have such a rule and blithely assume that the resources will be available on the Web forever. This is a risk which could be eliminated at virtually zero cost, and it should be, but my impression is that a lot of archives don’t do this.

It’s legitimate to stop making a URI usable as a URL, though it may be rude. W3C’s Namespaces in XML 1.0 says: “The namespace name, to serve its intended purpose, SHOULD have the characteristics of uniqueness and persistence. It is not a goal that it be directly usable for retrieval of a schema (if any exists).” (Emphasis added) That implies that any correct application really should do its own URI resolution.

One thing that isn’t legitimate, but I’ve occasionally seen, is replacing a schema with a new and incompatible version under the same URI. That can cause serious trouble for files that use the old schema. A new version of a schema needs to have a new URI.

The schema situation creates problems for hosting sites, applications, and archives. It’s vital to remember that you can’t count on the URI’s being a valid URL in the long term.

If you’ve got one of those old versions of JHOVE (1.5 and older, I think), please upgrade. The new versions are a lot less buggy anyway.

Comments are closed.