I’ve just gotten back from a “hackathon” at the University of Leeds, where about twenty specialists in digital preservation software got together and coded for two days. It was exciting to be with so many people in the field whom I’d previously known only through the Internet or hadn’t seen in years.
After an initial struggle with the university Wi-Fi, we coalesced into four groups to try to get demo-worthy projects done in the time available. There was a lot of interest in the Tika content analysis tool, with two of the projects being directly related to it. I was glad to learn that JHOVE2 is still alive, after a long period of seeming stagnation, and that a new release will be out soon.
It was evident from the discussions that once JHOVE2 becomes more widely used, there will be a lot of confusion about it and JHOVE, which are two entirely different products in spite of the similarity of names. Should JHOVE become “JHOVE Classic”? Should JHOVE2 get a new name? Any thoughts on this?
The bit that I was working on was extending FITS to add Tika to its collection of tools. Spencer McEwen, an ex-colleague from Harvard, nicely headed up the effort; Michael (last name?) from York also participated, and we got occasional help from several people outside our team. The messiest issue we ran into was getting Tika to give us the name of a file’s format (in addition to its MIME type, which is easy); also, we found Tika’s metadata vocabulary rather haphazard. We worked past these problems, though, and were able to get a demo that showed (if you were willing to read through piles of XML output) that Tika was being used along with the other tools and extracting some metadata about JPEG and PDF files.
We worked from Spencer’s fork of Harvard’s GitHub FITS project, which may replace the Google Code repository. This got us into issues of multiple users working on the same project at the same time and resolving code collisions. Git is supposed to have excellent facilities for this sort of thing, but they clearly take some learning. I could “stash” a repository but then couldn’t figure out how to get it back.
It was very energizing just to sit down with people and throw together code without meetings and managers to get in the way, as if I were a college student again. Hopefully some long-lasting results will come of this. I wouldn’t mind doing something like this again, though a trip to England is expensive.
I’ll add links to other posts on the event as I find them:
January’s mostly over, and I’ve only posted three times to this blog. Files that Last has been keeping me busy. My posting should pick up again before long, once I get a draft out to first readers.
One thing I’ve been looking at, with an eye to the upcoming SPRUCE Hackathon, is things that can be done with FITS. I’ve written up the results of some profiling experiments and quick attempts at optimization. FITS puts together a lot of tools for extracting file metadata, but there have been some complaints that it’s not as fast as it might be. The first results were surprising; the easiest way to get a small improvement was to factor out the initialization of namespace URIs for parsing XML. You wouldn’t think that would make any detectable difference, but the initialization of URIs in Xerces is surprisingly slow.
Another possibility to explore is improving the connection between FITS and JHOVE. Even though JHOVE is intended for use as a callable library, among other things, it’s designed to write to an output file. Some simple changes would let it provide an in-memory response without writing a file, which would be more useful to an application like FITS.
What happens when you get a bunch of developers from all over the world together on the Internet for one day of intensive work? A lot! For one thing, there’s the “Louis Wu’s birthday” effect; this “24-hour hackathon” was more like 48 hours. (In Niven and Pournelle’s Ringworld, Wu makes his birthday party last 48 hours by hopping from time zone to time zone with teleporters.) We didn’t have teleporters, so we made do with Twitter, IRC, and Google Hangouts. People in Australia started, and things wound down on the US west coast or maybe Hawaii.
Several things were happening, but the two most notable from my perspective were the Format Corpus project and the fork of FITS.
I watched the Format Corpus project with interest, though I didn’t participate in it. This is an openly licensed set of small example files in a wide variety of formats, as well as signature information. It could have a lot of uses; I’ll need to incorporate it into JHOVE testing.
People had been talking in advance of the hackathon about the need to improve the efficiency of FITS, a meta-tool developed by Harvard’s OIS (now LTS) to run various validation tools together on files. Internal ingest was and is the main purpose of FITS, but it was put up as open source and has been used in other places. I’d never worked on FITS proper at OTS (though I wrote parts of OTS-Schemas, which was broken out of FITS), but I’m familiar with the OIS style of coding, so I forked it on to Github and started looking at it. When Randy Stern at Harvard expressed concerns that the fork would create confusion (though I’d put a clear disclaimer from the beginning that it wasn’t the official version), I renamed it to OpenFITS.
The work is summarized on the hackathon wiki. The results are unclear at this point, but just opening the code up to more eyes could produce long-term benefits. The very first file I tested FITS on turned up a bug in JHOVE, and I wound up doing more work improving JHOVE than FITS. One source of potential significant improvements that I added was the ability to specify local copies of any XML schema. If you’re validating a lot of XML files that use the same schema, JHOVE has to get it from the Web, slowing the processing down. It’s necessary to do local configuration to take advantage of this, since every installation could need different schemas. The code is checked in but not available in a build yet.
It was thrilling to get to work with such an enthusiastic crowd from so many different places and, in a single 48-hour day, to see other people picking up my work and running it. I think there are already two or three third-generation forks of OpenFITS, including a Debian-Ubuntu package.
The information on just how Friday’s CURATEcamp 24 hour worldwide file id hackathon will work has been tricky for me to find, so here’s a summary for participants who read this blog:
Twitter: Hashtag #fileidhack
IRC: Server is irc.oftc.net, channel is #openarchives
The information is on the main wiki page for the hackathon, but it’s a little hard to spot with everything else that’s there.
See some of you there!