File identification tools, part 6: FITS

FITS is the File Information Tool Set, a “Swiss army knife” aggregating results from several file identification tools. The Harvard University Libraries created it, and though it was technically open-source from the beginning, it wasn’t very convenient for anyone outside Harvard at first. Other institutions showed interest, its code base moved from Google Code to GitHub, and now it’s used by a number of digital repositories to identify and validate ingested documents. Don’t confuse it with the FITS (Flexible Image Transport System) data format.

It’s a Java-based application requiring Java 7 or higher. Documentation is found on Harvard’s website. It wraps Apache Tika, DROID, ExifTool, FFIdent, JHOVE, the National Library of New Zealand Metadata Extractor, and four Harvard native tools. Work is currently under way to add the MediaInfo tool to enhance video file support. It’s released as open source software under the GNU LGPL license. The release dates show there’s been a burst of activity lately, so make sure you have the latest and best version.

FITS is tailored for ingesting files into a repository. In its normal mode of operation, it processes whole directories, including all nested subdirectories, and produces a single XML output file, which can be in either the FITS schema or other standard schemas such as MIX. You can run it as a standalone application or as a library. It’s possible to add your own tools to FITS.

You run FITS from a command file, fits.bat on Windows and fits.sh on Unix/Linux systems, including the Mac. The user manual provides full information.

You configure FITS with the file xml/fits.xml. It allows you to select which tools to use, and which file extensions each one will process. The <tool> element defines a tool to be used; its class attribute identifies its main class. If you want it to run only on files certain extensions, specify the include-exts attribute with a comma-separated list of extensions, not including the period. To run it on all extensions except certain was, specify the exclude-exts attribute with a comma-separated list of excluded extensions. The <output> element is trickier to deal with, and you shouldn’t mess with the <process> element unless you really need to diddle with performance.

FITS runs ExifTool as a separate process, since ExifTool is a Perl program. If your system doesn’t support Perl, ExifTool won’t run but everything else will still work.

I didn’t work directly on FITS when I was at Harvard, aside from my work on JHOVE, but in 2013 I traveled to the University of Leeds where I joined with some others in demonstrating some ways FITS could be improved, and this led to my getting a SPRUCE grant to implement the proposed improvements. Parts of this work were incorporated into the main line of the application.

The last I checked, FITS uses an old version of JHOVE because of compatibility issues. I don’t know if this has been updated.

Next: Apache Tika. To read this series from the beginning, start here.

Comments are closed.