Hackathon at Leeds

I’ve just gotten back from a “hackathon” at the University of Leeds, where about twenty specialists in digital preservation software got together and coded for two days. It was exciting to be with so many people in the field whom I’d previously known only through the Internet or hadn’t seen in years.

After an initial struggle with the university Wi-Fi, we coalesced into four groups to try to get demo-worthy projects done in the time available. There was a lot of interest in the Tika content analysis tool, with two of the projects being directly related to it. I was glad to learn that JHOVE2 is still alive, after a long period of seeming stagnation, and that a new release will be out soon.

It was evident from the discussions that once JHOVE2 becomes more widely used, there will be a lot of confusion about it and JHOVE, which are two entirely different products in spite of the similarity of names. Should JHOVE become “JHOVE Classic”? Should JHOVE2 get a new name? Any thoughts on this?

The bit that I was working on was extending FITS to add Tika to its collection of tools. Spencer McEwen, an ex-colleague from Harvard, nicely headed up the effort; Michael (last name?) from York also participated, and we got occasional help from several people outside our team. The messiest issue we ran into was getting Tika to give us the name of a file’s format (in addition to its MIME type, which is easy); also, we found Tika’s metadata vocabulary rather haphazard. We worked past these problems, though, and were able to get a demo that showed (if you were willing to read through piles of XML output) that Tika was being used along with the other tools and extracting some metadata about JPEG and PDF files.

We worked from Spencer’s fork of Harvard’s GitHub FITS project, which may replace the Google Code repository. This got us into issues of multiple users working on the same project at the same time and resolving code collisions. Git is supposed to have excellent facilities for this sort of thing, but they clearly take some learning. I could “stash” a repository but then couldn’t figure out how to get it back.

It was very energizing just to sit down with people and throw together code without meetings and managers to get in the way, as if I were a college student again. Hopefully some long-lasting results will come of this. I wouldn’t mind doing something like this again, though a trip to England is expensive.

I’ll add links to other posts on the event as I find them:

Worldwide file ID hackathon

What happens when you get a bunch of developers from all over the world together on the Internet for one day of intensive work? A lot! For one thing, there’s the “Louis Wu’s birthday” effect; this “24-hour hackathon” was more like 48 hours. (In Niven and Pournelle’s Ringworld, Wu makes his birthday party last 48 hours by hopping from time zone to time zone with teleporters.) We didn’t have teleporters, so we made do with Twitter, IRC, and Google Hangouts. People in Australia started, and things wound down on the US west coast or maybe Hawaii.

Several things were happening, but the two most notable from my perspective were the Format Corpus project and the fork of FITS.

I watched the Format Corpus project with interest, though I didn’t participate in it. This is an openly licensed set of small example files in a wide variety of formats, as well as signature information. It could have a lot of uses; I’ll need to incorporate it into JHOVE testing.

People had been talking in advance of the hackathon about the need to improve the efficiency of FITS, a meta-tool developed by Harvard’s OIS (now LTS) to run various validation tools together on files. Internal ingest was and is the main purpose of FITS, but it was put up as open source and has been used in other places. I’d never worked on FITS proper at OTS (though I wrote parts of OTS-Schemas, which was broken out of FITS), but I’m familiar with the OIS style of coding, so I forked it on to Github and started looking at it. When Randy Stern at Harvard expressed concerns that the fork would create confusion (though I’d put a clear disclaimer from the beginning that it wasn’t the official version), I renamed it to OpenFITS.

The work is summarized on the hackathon wiki. The results are unclear at this point, but just opening the code up to more eyes could produce long-term benefits. The very first file I tested FITS on turned up a bug in JHOVE, and I wound up doing more work improving JHOVE than FITS. One source of potential significant improvements that I added was the ability to specify local copies of any XML schema. If you’re validating a lot of XML files that use the same schema, JHOVE has to get it from the Web, slowing the processing down. It’s necessary to do local configuration to take advantage of this, since every installation could need different schemas. The code is checked in but not available in a build yet.

It was thrilling to get to work with such an enthusiastic crowd from so many different places and, in a single 48-hour day, to see other people picking up my work and running it. I think there are already two or three third-generation forks of OpenFITS, including a Debian-Ubuntu package.

Notes on Friday’s Hackathon

The information on just how Friday’s CURATEcamp 24 hour worldwide file id hackathon will work has been tricky for me to find, so here’s a summary for participants who read this blog:

Twitter: Hashtag #fileidhack
IRC: Server is irc.oftc.net, channel is #openarchives

The information is on the main wiki page for the hackathon, but it’s a little hard to spot with everything else that’s there.

See some of you there!

Online file ID hackathon

CURATEcamp and Open Planets Foundation will hold a 24-hour (possibly more, due to time zones) online hackathon on file identification on Friday, November 16. The announcement says:

24hour+ live hackathon event where multi-time zone teams work on common technical projects related to the CURATEcamp iPres 2012 file id discussions.

Project proposals can be made by anyone.

We will start the day with New Zealand (GMT +12:00) and end with North America West Coast wrapping up project(s), hopefully with one or two solid deliverables by 12 midnight-ish PST (GMT -8:00).

iPRES 2012

iPRES 2012 now has real information on its website.

IPRES proceedings

The IPRES proceedings for 2011 are now available.

IPRES 2012 will be in Toronto, making it the most convenient one for Americans in years. It will be September 30 to October 5 (which is when I was planning to be in Germany … just can’t win),

The future of file format identification

From the Digital Preservation Coalition website:

The National Archives is proposing to launch a new phase of development of its DROID tool, and is seeking to engage with various user groups and stakeholders from the digital preservation community, government and the wider archives sector communities to help inform and discuss potential developments and user needs. As part of this process, The National Archives, in conjunction with the Digital Preservation Coalition, invites interested parties to attend a one day workshop, hosted at Kew, to discuss their experiences of using DROID and PRONOM in their respective disciplines, discuss how the tools fit their use case, and describe both positive and negative experiences of the tools and their interaction with The National Archives.

The conference will be at the National Archives in Kew, London, on November 28. Registration is free for DPC members and associates and cheap for everyone else.


W3C has announced an upcoming conference on “HTML5 and the Open Web Platform”. The total information currently available is:

W3C, the web standards organization, is holding its first conference.
If you are a developer or designer wanting to hear the latest news on HTML5 and the open web platform, and your place in it, save the date. This event will be held in Seattle and live streaming to the world on November 15-16.
More details soon…

This is very short notice for a conference, but the topic is interesting.

JPEG 2000 summit presentations

Presentations from May’s JPEG2000 Summit are now available online.

JPEG2000 summit

It’s a bit late to get there if you didn’t already know about it, but the Library of Congress is hosting a JPEG 2000 summit in Washington today and tomorrow. Hopefully some interesting materials will be made public.