Apache Tika is a Java-based open source toolkit for identifying files and extracting metadata and text content. I don’t have much personal experience with it, apart from having used it with FITS. Apache Software Foundation is actively maintaining it, and version 1.9 just came out on June 23, 2015. It can identify a wide range of formats and report metadata from a smaller but still impressive set. You can use Tika as a command line utility, a GUI application, or a Java library. You can find its source code on GitHub, or you can get its many components from the Maven Repository.
Tika isn’t designed to validate files. If it encounters a broken file, it won’t tell you much about how it violates the format’s expectations.
Originally it was a subproject of Lucene; it became a standalone project in 2010. It builds on existing parser libraries for various formats where possible. For some formats it uses its own libraries because nothing suitable was available. In most cases it relies on signatures or “magic numbers” to identify formats. While it identifies lots of formats, it doesn’t distinguish variants in as much detail as some other tools, such as DROID. Andy Jackson has written a document that sheds light on the comparative strengths of Tika and DROID. Developers can add their own plugins for unsupported formats. Solr and Lucene have built-in Tika integration.
Prior to version 1.9, Tika didn’t have support for batch processing. Version 1.9 has a tika-batch module, which is described in the change notes as “experimental.”
The book Tika in Action is available as an e-book (apparently DRM free, though it doesn’t specifically say so) or in a print edition. Anyone interested in using its API or building it should look at the detailed tutorial on tutorialspoint.com. The Tika facade serves basic uses of the API; more adventurous programmers can use the lower-level classes.
Next: NLNZ Metadata Extraction Tool. To read this series from the beginning, start here.