My article on “File Format Analysis Tools for Archivists” is up on LWN.net.
- Follow Mad File Format Science on WordPress.com
FITS is the File Information Tool Set, a “Swiss army knife” aggregating results from several file identification tools. The Harvard University Libraries created it, and though it was technically open-source from the beginning, it wasn’t very convenient for anyone outside Harvard at first. Other institutions showed interest, its code base moved from Google Code to GitHub, and now it’s used by a number of digital repositories to identify and validate ingested documents. Don’t confuse it with the FITS (Flexible Image Transport System) data format.
It’s a Java-based application requiring Java 7 or higher. Documentation is found on Harvard’s website. It wraps Apache Tika, DROID, ExifTool, FFIdent, JHOVE, the National Library of New Zealand Metadata Extractor, and four Harvard native tools. Work is currently under way to add the MediaInfo tool to enhance video file support. It’s released as open source software under the GNU LGPL license. The release dates show there’s been a burst of activity lately, so make sure you have the latest and best version.
FITS is tailored for ingesting files into a repository. In its normal mode of operation, it processes whole directories, including all nested subdirectories, and produces a single XML output file, which can be in either the FITS schema or other standard schemas such as MIX. You can run it as a standalone application or as a library. It’s possible to add your own tools to FITS.
Last spring, I attended a Hackathon at the University of Leeds, which resulted in my getting a SPRUCE Grant for a month’s work enhancing FITS, a tool which at the time was technically open source but which the Harvard Library treated a bit possessively. After I finished, it seemed for a while that nothing was happening with my work, but it was just a matter of being patient enough. Collaboration between Harvard and the Open Planets Foundation has resulted in a more genuinely open FITS, which now has its own website. There’s also a GitHub repository with five contributors, none of which are me since my work was on an earlier repository that was incorporated into this one.
It really makes me happy to see my work reach this kind of fruition, even if I’m so busy on other things now that I don’t have time to participate.
Back in May, after an enjoyable trip to the University of Leeds, I worked for a month on improving the Harvard Library’s FITS tool for combining the results of several file format identification and validation tools. The results were well received and the Harvard Library incorporated some of my work in the main line of FITS. Still, there were a lot of loose ends left and more work to be done.
Things are picking up again with a “FITS Blitz” that’s starting this week. Paul Wheatley writes that “in partnership with Harvard and the Open Planets Foundation (with support from Creative Pragmatics), SPRUCE is supporting a two week project to get the technical infrastructure in place to make FITS genuinely maintainable by the community. ‘FITS Blitz’ will merge the existing code branches and establish a comprehensive testing setup so that further code developments only find their way in when there is confidence that other bits of functionality haven’t been damaged by the changes.”
I’ve moved on to other things, so I won’t be able to participate, but I wish them every success.
It would be helpful for me to have at least a partial list of institutions that are using Harvard’s FITS (File Information Tool Set). If you can help me build this list, could you reply here or contact me by other usual channels? Thanks.
January’s mostly over, and I’ve only posted three times to this blog. Files that Last has been keeping me busy. My posting should pick up again before long, once I get a draft out to first readers.
One thing I’ve been looking at, with an eye to the upcoming SPRUCE Hackathon, is things that can be done with FITS. I’ve written up the results of some profiling experiments and quick attempts at optimization. FITS puts together a lot of tools for extracting file metadata, but there have been some complaints that it’s not as fast as it might be. The first results were surprising; the easiest way to get a small improvement was to factor out the initialization of namespace URIs for parsing XML. You wouldn’t think that would make any detectable difference, but the initialization of URIs in Xerces is surprisingly slow.
Another possibility to explore is improving the connection between FITS and JHOVE. Even though JHOVE is intended for use as a callable library, among other things, it’s designed to write to an output file. Some simple changes would let it provide an in-memory response without writing a file, which would be more useful to an application like FITS.