Monthly Archives: May 2013

JHOVE and XHTML

I’m surprised I only got a complaint about this recently. Using JHOVE to validate XHTML files is often painfully slow. In fact, using anything to validate them without caching or redirection of DTDs would be painfully slow. The DOCTYPE declaration brings in the standard XHTML DTD, and it in turn brings in lots of other DTDs. These all have URLs on w3.org. As you can imagine, this is a lot of traffic converging in one place, and the response is often very slow.

JHOVE has a remedy, but it turns out not to work in this case. In the configuration file, you can declare local copies of schemas and DTDs to be loaded by the SAX entity resolver. This looks something like this:

 <module>
   <class>edu.harvard.hul.ois.jhove.module.XmlModule</class>
  <param>schema=http://www.w3.org/TR/REC-smil/SMIL10.dtd;/Users/gmcgath/schemas/SMIL10.dtd</param>
 </module>

Unfortunately, there are some problems in JHOVE 1.9. The HTML module processes XHTML files by passing them to the XML module. In this case, the module doesn’t get the parameters that the config file declared for it. In JHOVE 1.10, I’ll fix this by having the HTML module pass its own parameters to the XML module. At present, JHOVE’s processing of XHTML files makes no use of the configuration file’s instructions to the entity resolver.

There’s another complication. The XHTML DTD invokes other DTDs, and JHOVE has to get every one of those in turn. Some of them have relative URLs to other DTDs; these break when they’re redirected to local files. Even making local copies of all the files doesn’t work, as JHOVE doesn’t handle the relative URLs correctly within the file system, and making them work would require changing some existing assumptions. The best fix for the user is to get JHOVE 1.10 when it’s ready (version 1.10B2 doesn’t have the XHTML fix yet) edit all those files so that all the URLs are absolute.

This is a big chunk of work, and I haven’t tested the approach fully. Any ideas on how this might be better handled would be appreciated.

JHOVE 1.10b2

I’ve put up JHOVE 1.10b2. It has a bit of optimization for the PDF module, though files with huge structure trees are still painfully slow.

Streaming protocols

Last week I was doing some consulting work on Wowza Media Server for the Harvard Library, and I noticed there are some issues about streaming protocols which often aren’t well understood. To help clarify them in my own mind, and hopefully provide a useful resource for others, I’ve put a page on Basics of Streaming Protocols on my business website.

If you notice anything that’s wrong or confusing, please let me know.

JHOVE 1.10b1

I’ve put up a new beta version of JHOVE, 1.10b1, on SourceForge.

The major change since last time is the handling of structure trees in PDF files; this should keep JHOVE from hanging or running out of memory on some PDF files as it used to. Please report any problems soon.

A PDF question

A while back, I posted a question on superuser.com about a PDF issue that’s causing problems in JHOVE. So far it hasn’t gotten any answers, so I’m signal-boosting my own question here. Here’s what I asked:

The JHOVE parser for PDF, which I maintain, will sometimes find a non-dictionary object in a PDF’s Annots array. According to section 8.4.1 of the PDF spec, the Annots array holds “an array of annotation dictionaries.” In the case that I’m looking at right now, there’s a keyword of “Annot” instead of a dictionary. Is this an invalid PDF file, or is there a subtlety in the spec which I’ve overlooked?

Answering on stackoverflow.com is best, so other people can see the answer, but if you prefer to answer here, I’ll post or summarize any useful response, with attribution, as an answer over there.

The future of WebM

Yesterday I posted about the WebP still image format, expressing some skepticism about how easily it will catch on. Its companion format for video, WebM, may stand a better chance, though. Images aren’t exciting any more; JPEG delivers photographs well enough, PNG does the same for line art, and there isn’t a compelling reason to change. Video is still in flux, though, and the high bandwidth requirements mean there’s a payoff for any improvements in compression and throughput. The long-running battle among HTML5 stakeholders over video shows that it’s far from being a settled area. Patents are a big issue; if you implement H.264, you have to pay money. Alternatives are attractive from both a technological and an economic standpoint.

With Google pushing WebM and having YouTube, there’s a clear reason for browser developers to support it. YouTube plans to use the new WebM codec, VP9, once it’s complete. I haven’t seen details of the plan, but most likely YouTube will make the same video available with multiple protocols and query the browser’s capabilities to determine whether it can accept VP9. If the advantage is real and users who can get it see fewer pauses in their videos, more browser makers will undoubtedly join the bandwagon.

An eye on WebP

Google has been promoting the WebP still image format for some time, and lately Facebook has added its support. It’s hard to displace the well-entrenched JPEG, but it could happen. It supports both lossy and lossless compression, and Google claims it offers a significant advantage in compression over PNG and JPEG. Google says it’s free of patent restrictions; the container is the familiar RIFF. The VP8 lossy format is available as an IETF RFC; a specification for the lossless format is also available.

The container spec supports XMP and Exif metadata. Canvas width and height can be as much as 16,777,216 pixels, though their product is limited to 4,294,967,296 pixels. As far as I can tell it doesn’t support tiling, though, so partial rendering of huge images in the style of JPEG2000 may not be practical.

Chrome, Opera, and Ice Cream Sandwich offer WebP support, but not many other browsers do. Facebook’s offerings of WebP images have resulted in complaints from users whose browsers can’t read the format. The Firefox development team is starting to warm to it but hasn’t committed to anything yet. Internet Explorer hasn’t even reached that point.

It’s still early to make bets, but WebP increasingly bears watching. I’ve initiated a page for updates and errata for Files that Last with some updated information on WebP. (When I wrote the book, I couldn’t find the lossless spec.)

Using DROID with Java 7

It’s been a problem for a while that DROID 6 won’t run under Java 7. Matt Palmer has reported a simple fix for this, requiring only a change in pom.xml. Hopefully a release incorporating this change will appear soon.