I’m surprised I only got a complaint about this recently. Using JHOVE to validate XHTML files is often painfully slow. In fact, using anything to validate them without caching or redirection of DTDs would be painfully slow. The DOCTYPE declaration brings in the standard XHTML DTD, and it in turn brings in lots of other DTDs. These all have URLs on w3.org. As you can imagine, this is a lot of traffic converging in one place, and the response is often very slow.
JHOVE has a remedy, but it turns out not to work in this case. In the configuration file, you can declare local copies of schemas and DTDs to be loaded by the SAX entity resolver. This looks something like this:
<module> <class>edu.harvard.hul.ois.jhove.module.XmlModule</class> <param>schema=http://www.w3.org/TR/REC-smil/SMIL10.dtd;/Users/gmcgath/schemas/SMIL10.dtd</param> </module>
Unfortunately, there are some problems in JHOVE 1.9. The HTML module processes XHTML files by passing them to the XML module. In this case, the module doesn’t get the parameters that the config file declared for it. In JHOVE 1.10, I’ll fix this by having the HTML module pass its own parameters to the XML module. At present, JHOVE’s processing of XHTML files makes no use of the configuration file’s instructions to the entity resolver.
There’s another complication. The XHTML DTD invokes other DTDs, and JHOVE has to get every one of those in turn. Some of them have relative URLs to other DTDs; these break when they’re redirected to local files. Even making local copies of all the files doesn’t work, as JHOVE doesn’t handle the relative URLs correctly within the file system, and making them work would require changing some existing assumptions. The best fix for the user is to get JHOVE 1.10 when it’s ready (version 1.10B2 doesn’t have the XHTML fix yet) edit all those files so that all the URLs are absolute.
This is a big chunk of work, and I haven’t tested the approach fully. Any ideas on how this might be better handled would be appreciated.
I can’t remember, but I assume JHOVE implements it’s own XML parser, rather than being able to rely on the XML Catalog functionality of the underlying SAX parser provided by the JVM? http://en.wikipedia.org/wiki/XML_Catalog
No, JHOVE uses SAX; there are limits to the “not invented here” mindset, even here. :) But it has its own SAX handler, including an entity resolver.
This post has resulted in quite a discussion on Twitter. Rather than trying to reply to everything in 140-character chunks, I’ll address the most important points here.
A workaround that was proposed is to use a proxy that will cache HTTP requests. This can be done using the http.proxyHost and http.proxyPort environment variables, or perhaps something like this example. If you’ve got a proxy available that will do the caching, that’s a possible solution.
It’s true that XHTML is mostly being replaced by HTML5 (including its XHTML form), but no HTML format ever really goes away completely. The issue came up because of an inquiry from the Harvard Library, which seems to have a fair amount of XHTML. I don’t want to ignore them completely.
I’m amazed anyone considers the HTML module in JHOVE to be of any value; usually my first recommendation for better performance is to disable that module. Still, people do use it.
It would definitely be interesting to see how much of Harvard Library’s XHTML actually validates, and based on that how valuable it would be to know that it validates…