FITS is the File Information Tool Set, a “Swiss army knife” aggregating results from several file identification tools. The Harvard University Libraries created it, and though it was technically open-source from the beginning, it wasn’t very convenient for anyone outside Harvard at first. Other institutions showed interest, its code base moved from Google Code to GitHub, and now it’s used by a number of digital repositories to identify and validate ingested documents. Don’t confuse it with the FITS (Flexible Image Transport System) data format.
It’s a Java-based application requiring Java 7 or higher. Documentation is found on Harvard’s website. It wraps Apache Tika, DROID, ExifTool, FFIdent, JHOVE, the National Library of New Zealand Metadata Extractor, and four Harvard native tools. Work is currently under way to add the MediaInfo tool to enhance video file support. It’s released as open source software under the GNU LGPL license. The release dates show there’s been a burst of activity lately, so make sure you have the latest and best version.
FITS is tailored for ingesting files into a repository. In its normal mode of operation, it processes whole directories, including all nested subdirectories, and produces a single XML output file, which can be in either the FITS schema or other standard schemas such as MIX. You can run it as a standalone application or as a library. It’s possible to add your own tools to FITS.
Continue reading →
Last spring, I attended a Hackathon at the University of Leeds, which resulted in my getting a SPRUCE Grant for a month’s work enhancing FITS, a tool which at the time was technically open source but which the Harvard Library treated a bit possessively. After I finished, it seemed for a while that nothing was happening with my work, but it was just a matter of being patient enough. Collaboration between Harvard and the Open Planets Foundation has resulted in a more genuinely open FITS, which now has its own website. There’s also a GitHub repository with five contributors, none of which are me since my work was on an earlier repository that was incorporated into this one.
It really makes me happy to see my work reach this kind of fruition, even if I’m so busy on other things now that I don’t have time to participate.
Back in May, after an enjoyable trip to the University of Leeds, I worked for a month on improving the Harvard Library’s FITS tool for combining the results of several file format identification and validation tools. The results were well received and the Harvard Library incorporated some of my work in the main line of FITS. Still, there were a lot of loose ends left and more work to be done.
Things are picking up again with a “FITS Blitz” that’s starting this week. Paul Wheatley writes that “in partnership with Harvard and the Open Planets Foundation (with support from Creative Pragmatics), SPRUCE is supporting a two week project to get the technical infrastructure in place to make FITS genuinely maintainable by the community. ‘FITS Blitz’ will merge the existing code branches and establish a comprehensive testing setup so that further code developments only find their way in when there is confidence that other bits of functionality haven’t been damaged by the changes.”
I’ve moved on to other things, so I won’t be able to participate, but I wish them every success.
Last Friday’s CURATEcamp AVpres was a collaboration between several physical sites, using Google Hangout and IRC. I’d been asked if I could do a lightning presentation online on my work on FITS, but I had a commitment on the 19th, so Andrea Goethals at the Harvard Library said she’d do one.
That, unfortunately, was the day the Tsarnaev brothers went on their spree in Cambridge, and Harvard was closed for the day. Paul Wheatley picked up the job on short notice and did a presentation; the slide show is online. Paul suggested people should look at the work I’m putting on the Github repository after I’m finished at the end of April, but I wouldn’t mind if people tried it out now, while I’m still devoting my time to the project.
Posted in Links, News
Tagged FITS, software
For the month of April, I’ll be working full time under a SPRUCE grant on making improvement to FITS, the Harvard Library’s File Information Tool Set. For this purpose I’ve created a fork on Github. This is a fresh fork from Harvard’s FITS repository on Github, so I’ve marked my older fork, OpenFITS, as deprecated. Harvard’s official version of FITS is still the Google Code repository.
My Github repository includes a wiki where I’ll describe progress in detail. The issues area is available for input. I don’t have any plans to address existing bugs in FITS, so please use this just for input on my work, including suggestions.
One area where I really want input is on what FITS should produce for video metadata. There isn’t a lot of consensus yet on what product-independent video metadata should look like. FITS has six different categories of files, each with its own metadata set: text, document, image, audio, video, and unknown. The metadata produced is a composite of the output from the various tools (including Tika, which I’m adding). The point isn’t to use an existing schema, but to put together a list of elements that characterize a video document. “Significant properties,” as people don’t like to say. XMP and MPEG-7 provide ideas, and most if not all audio metadata elements are also applicable to video. I’ve started a wiki page on video metadata within the Github project.
If you have an interest in shaping the output of FITS for video files, please provide input by commenting here, putting an issue on Github, emailing me, or whatever works best for you.
It would be helpful for me to have at least a partial list of institutions that are using Harvard’s FITS (File Information Tool Set). If you can help me build this list, could you reply here or contact me by other usual channels? Thanks.
January’s mostly over, and I’ve only posted three times to this blog. Files that Last has been keeping me busy. My posting should pick up again before long, once I get a draft out to first readers.
One thing I’ve been looking at, with an eye to the upcoming SPRUCE Hackathon, is things that can be done with FITS. I’ve written up the results of some profiling experiments and quick attempts at optimization. FITS puts together a lot of tools for extracting file metadata, but there have been some complaints that it’s not as fast as it might be. The first results were surprising; the easiest way to get a small improvement was to factor out the initialization of namespace URIs for parsing XML. You wouldn’t think that would make any detectable difference, but the initialization of URIs in Xerces is surprisingly slow.
Another possibility to explore is improving the connection between FITS and JHOVE. Even though JHOVE is intended for use as a callable library, among other things, it’s designed to write to an output file. Some simple changes would let it provide an in-memory response without writing a file, which would be more useful to an application like FITS.