Tag Archives: XHTML

JHOVE and XHTML

I’m surprised I only got a complaint about this recently. Using JHOVE to validate XHTML files is often painfully slow. In fact, using anything to validate them without caching or redirection of DTDs would be painfully slow. The DOCTYPE declaration brings in the standard XHTML DTD, and it in turn brings in lots of other DTDs. These all have URLs on w3.org. As you can imagine, this is a lot of traffic converging in one place, and the response is often very slow.

JHOVE has a remedy, but it turns out not to work in this case. In the configuration file, you can declare local copies of schemas and DTDs to be loaded by the SAX entity resolver. This looks something like this:

 <module>
   <class>edu.harvard.hul.ois.jhove.module.XmlModule</class>
  <param>schema=http://www.w3.org/TR/REC-smil/SMIL10.dtd;/Users/gmcgath/schemas/SMIL10.dtd</param>
 </module>

Unfortunately, there are some problems in JHOVE 1.9. The HTML module processes XHTML files by passing them to the XML module. In this case, the module doesn’t get the parameters that the config file declared for it. In JHOVE 1.10, I’ll fix this by having the HTML module pass its own parameters to the XML module. At present, JHOVE’s processing of XHTML files makes no use of the configuration file’s instructions to the entity resolver.

There’s another complication. The XHTML DTD invokes other DTDs, and JHOVE has to get every one of those in turn. Some of them have relative URLs to other DTDs; these break when they’re redirected to local files. Even making local copies of all the files doesn’t work, as JHOVE doesn’t handle the relative URLs correctly within the file system, and making them work would require changing some existing assumptions. The best fix for the user is to get JHOVE 1.10 when it’s ready (version 1.10B2 doesn’t have the XHTML fix yet) edit all those files so that all the URLs are absolute.

This is a big chunk of work, and I haven’t tested the approach fully. Any ideas on how this might be better handled would be appreciated.