Monthly Archives: November 2015

Iterating a directory in command line Tika

Apache Tika is best used as a library to wrap your own code around. Its GUI application is a toy, and its command line version isn’t all that great either. The command line can be improved with a little scripting, though.
Continue reading

The FLIF format

flif logoNew image file formats keep turning up, taking advantage of advances in compression technology. One of the latest is FLIF, Free Lossless Image Format. It claims to outcompress PNG, lossless JPEG2000, lossless WebP, and lossless BPG. Though it has only a lossless mode, it claims that “FLIF works well on any kind of image, so the end-user does not need to try different algorithms and parameters.”
Continue reading

The coming of WebP (or not)

The WebP image format has been around for about five years, but till recently it’s been mostly a curiosity. I last blogged about it in 2013, when it didn’t have very wide support. Since then most browsers have adopted it, and now Google+ is making more use of it (no surprise, since Google is the format’s principal backer). It promises smarter lossy compression than JPEG and smaller file sizes for the same image quality.
Continue reading


Video: Introduction to JHOVE

A new video on my YouTube channel offers a seven-minute introduction to JHOVE. This is a teaser for my upcoming video course on file format identification tools, as well as a public test of the techniques I’ve been developing. It’s a screen capture video, and I cover the GUI version, even if it’s not as widely used, because it lets me focus on the concepts, and because it’s silly to teach a command line application in a video.
Continue reading

A link roundup on file formats

3D printing is an exciting new technology, but the formats to choose from are an alphabet soup.

A call for “PDF 2.0” or an “Analytical File Format.” The description is vague, but it sounds like something analogous to the Semantic Web for documents.

BW64, a new RIFF-based audio format. The article describes it as a “3D” format, but more significantly it’s a metadata-rich interchange format that supports really big files.

And just for bitter laughs: I need a ‘file’ format.”

A sock puppet mystery

The SourceForge repository for JHOVE (which is, by the way, obsolete; here’s the active repository) includes three short reviews which give it five stars and make very generic and identical comments. They’re dated on three successive days. Those are clear signs of sock-puppet accounts.

I can understand why people post glowing but fake reviews to their own project sites, but really, I’m not responsible for these, and I was the only person working on JHOVE at the time, so I can’t imagine who else had an incentive to promote it. Checking on one of these accounts, “rusik1978,” I find similar reviews on many other SourceForge projects. If they linked back to something it would make sense, but they don’t.

I’ve learned from this that sock puppet reviews don’t necessarily prove that the project owner is faking praise. Maybe that’s the point, to make it harder to identify the actual paid reviews?


The PDF Association and TWAIN Working Group have announced a partnership to develop a specification called PDF/Raster or PDF/R. It’s described as “a component of TWG’s TWAIN Direct™ initiative, a language/protocol that eliminates the need for users to install vendor specific drivers as communication between scanning devices and image capture software applications.”
Continue reading

McCoy on the future of PDF

Bill McCoy’s article, “Takeaways on the Future of Documents: Report from the 2015 PDF Technical Conference,” offers some interesting thoughts on the future of PDF. I can’t find much to disagree with. PDF is in practice a format for reproducing a specific document appearance, and that’s becoming less important as the variety of computing devices increases. He makes a point I hadn’t thought of, that the “de facto interoperable PDF format” is well behind the latest specifications, which may explain why I haven’t seen complaints that JHOVE doesn’t know about ISO 32000 PDF!
Continue reading

JHOVE 1.12 beta

New JHOVE logoJHOVE 1.12 will be the first release of JHOVE that I had no significant role in, but I’m still glad to see that the beta release is now available. I’ve downloaded it, run the installer (yes, there’s now an installer!), and then launched JHOVE without having to edit any configuration files by hand! That’s a huge advance by itself. Nice work by Carl Wilson and everyone else at the Open Preservation Foundation. It’s now built with Maven, and I’m sure that the building process is much better than the clunky old one.
Continue reading