Tying XML schemas to URIs was the worst mistake in the history of XML. Once you publish a schema URI and people start using it, you can’t change it without major disruption.
URIs aren’t permanent. Domains can disappear or change hands. Even subdomains can vanish with organizational changes. When I was at Harvard, I offered repeated reminders that hul.harvard.edu
can’t go away with the deprecation of the name “Harvard University Library/Libraries,” since it houses schemas for JHOVE and other applications. Time will tell whether it will stay.
Strictly speaking, a URI is a Uniform Resource identifier and has no obligation to correspond to a web page; W3C says a URI as a schema identifier is only a name. In practice, treating it as a URL may be the only way to locate the XSD. When a URI uses the http
scheme, it’s an invitation to use it as a URL.
Even if a domain doesn’t go away, it can be burdened with schema requests beyond its hosting capacity. The Harvard Library has been trying to get people to upgrade to the current version of JHOVE, which uses an entity resolver, but its server was, the last I checked, still heavily hit by three sites that hadn’t upgraded. They don’t pay anything, so there’s no money to put into more server capacity.
The best solution available is for software to resolve schema names to local copies (e.g. with Java’s EntityResolver
). This solution often doesn’t occur to people until there’s a problem, though, and by then there may be lots of copies of the old software out in the field.
For archival storage, keeping a copy of any needed schema files should be a requirement. Resources inevitably disappear from the Web, including schemas. My impression is that a lot of digital archives don’t have such a rule and blithely assume that the resources will be available on the Web forever. This is a risk which could be eliminated at virtually zero cost, and it should be, but my impression is that a lot of archives don’t do this.
It’s legitimate to stop making a URI usable as a URL, though it may be rude. W3C’s Namespaces in XML 1.0 says: “The namespace name, to serve its intended purpose, SHOULD have the characteristics of uniqueness and persistence. It is not a goal that it be directly usable for retrieval of a schema (if any exists).” (Emphasis added) That implies that any correct application really should do its own URI resolution.
One thing that isn’t legitimate, but I’ve occasionally seen, is replacing a schema with a new and incompatible version under the same URI. That can cause serious trouble for files that use the old schema. A new version of a schema needs to have a new URI.
The schema situation creates problems for hosting sites, applications, and archives. It’s vital to remember that you can’t count on the URI’s being a valid URL in the long term.
If you’ve got one of those old versions of JHOVE (1.5 and older, I think), please upgrade. The new versions are a lot less buggy anyway.
Canvas fingerprinting, the technical stuff
The ability of websites to bypass privacy settings with “canvas fingerprinting” has caused quite a bit of concern, and it’s become a hot topic on the Code4lib mailing list. Let’s take a quick look at it from a technical standpoint. It is genuinely disturbing, but it’s not the unstoppable form of scrutiny some people are hyping it as.
The best article to learn about it from is “Pixel Perfect: Fingerprinting Canvas in HTML5,” by Keaton Mowery and Hovav Shacham at UCSD. It describes the basic technique and some implementation details.
Canvas fingerprinting is based on the <canvas> HTML element. It’s been around for a decade but was standardized for HTML5. In itself, <canvas> does nothing but define a blank drawing area with a specified width and height. It isn’t even like the <div> element, which you can put interesting stuff inside; if all you use is unscripted HTML, all you get is some blank space. To draw anything on it, you have to use JavaScript. There are two APIs available for this: the 2D DOM Canvas API and the 3D WebGL API. The DOM API is part of the HTML5 specification; WebGL relies on hardware acceleration and is less widely supported.
Either API lets you draw objects, not just pixels, to a browser. These include geometric shapes, color gradients, and text. The details of drawing are left to the client, so they will be drawn slightly differently depending on the browser, operating system, and hardware. This wouldn’t be too exciting, except that the API can read the pixels back. The
getImageData
method of the 2D context returns anImageData
object, which is a pixel map. This can be serialized (e.g., as a PNG image) and sent back to the server from which the page originated. For a given set of drawing commands and hardware and software configuration, the pixels are consistent.Drawing text is one way to use a canvas fingerprint. Modern browsers use a programmatic description of a font rather than a bitmap, so that characters will scale nicely. The fine details of how edges are smoothed and pixels interpolated will vary, perhaps not enough for any user to notice, but enough so that reading back the pixels will show a difference.
However, the technique isn’t as frightening as the worst hype suggests. First, it doesn’t uniquely identify a computer. Two machines that have the same model and come from the same shipment, if their preinstalled software hasn’t been modified, should have the same fingerprint. It has to be used together with other identifying markers to narrow down to one machine. There are several ways for software to stop it, including blocking JavaScript from offending domains and disabling part or all of the Canvas API. What gets people upset is that neither blocking cookies nor using a proxy will stop it.
Was including
getImageData
in the spec a mistake? This can be argued both ways. Its obvious use is to draw a complex canvas once and then rubber-stamp it if you want it to appear multiple times; this can be faster than repeatedly drawing from scratch. It’s unlikely, though, that the designers of the spec thought about its privacy implications.Comments Off on Canvas fingerprinting, the technical stuff
Posted in commentary
Tagged DOM, HTML, html5, JavaScript, W3C