Hackathon at Leeds

I’ve just gotten back from a “hackathon” at the University of Leeds, where about twenty specialists in digital preservation software got together and coded for two days. It was exciting to be with so many people in the field whom I’d previously known only through the Internet or hadn’t seen in years.

After an initial struggle with the university Wi-Fi, we coalesced into four groups to try to get demo-worthy projects done in the time available. There was a lot of interest in the Tika content analysis tool, with two of the projects being directly related to it. I was glad to learn that JHOVE2 is still alive, after a long period of seeming stagnation, and that a new release will be out soon.

It was evident from the discussions that once JHOVE2 becomes more widely used, there will be a lot of confusion about it and JHOVE, which are two entirely different products in spite of the similarity of names. Should JHOVE become “JHOVE Classic”? Should JHOVE2 get a new name? Any thoughts on this?

The bit that I was working on was extending FITS to add Tika to its collection of tools. Spencer McEwen, an ex-colleague from Harvard, nicely headed up the effort; Michael (last name?) from York also participated, and we got occasional help from several people outside our team. The messiest issue we ran into was getting Tika to give us the name of a file’s format (in addition to its MIME type, which is easy); also, we found Tika’s metadata vocabulary rather haphazard. We worked past these problems, though, and were able to get a demo that showed (if you were willing to read through piles of XML output) that Tika was being used along with the other tools and extracting some metadata about JPEG and PDF files.

We worked from Spencer’s fork of Harvard’s GitHub FITS project, which may replace the Google Code repository. This got us into issues of multiple users working on the same project at the same time and resolving code collisions. Git is supposed to have excellent facilities for this sort of thing, but they clearly take some learning. I could “stash” a repository but then couldn’t figure out how to get it back.

It was very energizing just to sit down with people and throw together code without meetings and managers to get in the way, as if I were a college student again. Hopefully some long-lasting results will come of this. I wouldn’t mind doing something like this again, though a trip to England is expensive.

I’ll add links to other posts on the event as I find them:

4 responses to “Hackathon at Leeds

  1. > git stash pop

    will pop the last stashed changeset from the stack and apply it. v handy.

  2. Hi Gary,

    independently of everything else we’ve developed a partial parser for Apache Tika see https://issues.apache.org/jira/browse/TIKA-874 for our comment and the links