Notes on Friday’s Hackathon

The information on just how Friday’s CURATEcamp 24 hour worldwide file id hackathon will work has been tricky for me to find, so here’s a summary for participants who read this blog:

Twitter: Hashtag #fileidhack
IRC: Server is irc.oftc.net, channel is #openarchives

The information is on the main wiki page for the hackathon, but it’s a little hard to spot with everything else that’s there.

See some of you there!

Embracing the chaos of formats

We often think of formats in terms of specifications and standards, and this can be a useful thing. If you want to know exactly what the Latin-1 encoding is, you can look at the ISO-8859-1 standard and it will tell you. However, this isn’t always a reliable guide to what’s out there. Someone noticed that ISO-8859 reserves lots of control codes that are rarely used and put additional printing characters there. This got codified as well, as Windows 1252 (which Microsoft falsely claims as an ANSI standard), but there are many ad hoc or obscure encodings which are hard or impossible to find references for.

Earth’s official authorities refused to grant the Klingons a place in Unicode for their characters; nonetheless, there is an unofficial registry that uses part of the Unicode Private Use Area for Klingon and other constructed scripts. Is it official Unicode? No. If you use code points F8D0-F8FF, will others recognize them as Klingon characters? Sometimes.

I’ve written about the TIFF situation before. The TIFF 6.0 spec is an insufficient guide to today’s real-life TIFF. You have to go through scattered tech notes to understand how it’s really used.

Understanding situations like these requires understanding that formats don’t flow unchanged from the minds of their designers to their implementation in the world’s computers. People change things to meet their needs. This makes them more useful for some purposes; at the same time, it makes them more confusing. The only alternative would be to create a format police force with the power to arrest and punish innovators.

The situation is analogous to natural language. You can insist that anything that disagrees with the grammar books is wrong, but if everybody talks that way, there ain’t no stoppin’ it. At the same time, the grammar books put a brake on unnecessary change, keeping the language from breaking down into a thousand mutually unintelligible dialects.

Digital preservationists have to look at the actual usage of formats, not just their official specifications. This doesn’t mean that they should accept every deviation, but they need to acknowledge changes that have become de facto standards. Context matters; an archive of ninteenth-century literature doesn’t have to be concerned with Klingon characters, but an archive of science fiction fan literature had better take them into account. Even an occasional scholarly paper might have a word or two in the pIqaD script.

This proliferation of variants is a big part of why centralized registries of format information don’t work. Not only is there too much information, it keeps changing. The best we can hope for is a coordinated way of finding our way through a chaotic body of information.

JHOVE 1.8

I hadn’t heard any bug reports since 1.8 beta, which hopefully means it’s working smoothly for everyone, so I’ve now released JHOVE 1.8. Let me know ASAP if anything’s broken.

Release notes:

GENERAL

1. If JHOVE doesn’t find a configuration file, it creates a default one.

2. Generics widely added to clean up the code.

3. build.xml files fixed to force compilation to Java 1.5.

4. Shell script “jhove” no longer makes you figure out where JAVA_HOME is.

PDF MODULE

1. Several errors in checking for PDF-A compliance were corrected. Aside from fixing some outright bugs, the Contents key for non-text Annotations is no longer checked, as its presence is only recommended and not required.

2. Improved code by Hökan Svenson is now used for finding the trailer.

TIFF MODULE

1. TIFF tag 700 (XMP) now accepts field type 7 (UNDEFINED) as well as 1 (BYTE), on the basis of Adobe’s XMP spec, part 3.

2. If compression scheme 6 is used in a file, an InfoMessage will report that the file uses deprecated compression.

WAVE MODULE

1. The Originator Reference property, found in the Broadcast Wave Extension (BEXT) chunk, is now reported.

Et tu, WordPress?

Market researchers for websites don’t research the market any more. They research the competition in order to figure out how to imitate it. LiveJournal is making its site more Facebook-like in spite of nearly unanimous disapproval in the comments. WordPress has been infected with the same desire to imitate the “social media” sites. It’s implemented “infinite scrolling” of blog pages, making footers impossible and the question of whether something is on the current page ambiguous. If you try to get to the WordPress home page (www.wordpress.com), you can’t; you’re automatically redirected to your statistics page.

If I’m checking my spam list and happen to bring the mouse cursor over the icon of one of the spammers, a preview of the spammer’s page pops up. Because it’s “cool,” not because anyone in the world wants it to.

Is it only a matter of time before WordPress starts delivering RSS feeds out of sequence and arbitrarily leaving out some posts? I hope not.

“Just solve the problem” month begins

Today is the start of a month which some digital preservationists have declared “Just Solve the Problem” month. I’ve already expressed a mixture of skepticism and hope for this; throwing resources pell-mell at a computer problem rarely works, but some good is bound to come of the effort. We will not come out of November with “the problem” solved, but there will be new resources, such as this page of links to format information. (This blog is included in the list.)

I’m working on a list of plain text formats, expanding on my earlier post on the subject. This will appear on garymcgath.com, hopefully within the next week. Also, I’ve started a page on the wiki on tools, with a relevant subset of the list on my own site, restricted to locally runnable applications.

Between this and the CURATEcamp hackathon on November 16, lots of interesting stuff is happening in preservation this month.

Format registry browser on Github

I’ve put the format-reg-browser project up on Github, in case anyone wants to play with the code. This is the first time I’ve committed code to any kind of Git site, but it looks as if the code’s really there. Let me know if there are any problems.

JHOVE 1.8b2

Oops… The Java 7 compiler on Ubuntu won’t build backwards-compatible classes, so JHOVE 1.8b1 wouldn’t run on earlier versions of Java. JHOVE 1.8b2 should fix the problem.

“Just solve the problem”

Running concurrently with National Novel Writing Month (aka NaNoWriMo) is “Just Solve the Problem,” an effort to get lots of people to attack the “formats problem” for 30 days.

Here’s “the problem,” slightly expurgated to avoid triggering nannyware:

In the last couple centuries, we’ve created a number of self-encapsulated data sets, or “files”. Be they letters, programs, tapes, stamped foil, piano rolls, you name it. And while many of those data sets are self- evident, a ****-ton are not. They’re obscure. They’re weird. And worst of all, many of them are the vital link to scores of historical information.

First thought: That’s not a statement of a solvable problem. It’s a statement of a situation which gives rise to many different problems. Still, throwing in some of my efforts can lead to professional contacts and maybe even a paying contract, and it’s the kind of thing I’d be doing anyway, so I’ve signed up for the wiki.

Extra points to anyone who can write a novel about the formats problem in 30 days.

Online file ID hackathon

CURATEcamp and Open Planets Foundation will hold a 24-hour (possibly more, due to time zones) online hackathon on file identification on Friday, November 16. The announcement says:

24hour+ live hackathon event where multi-time zone teams work on common technical projects related to the CURATEcamp iPres 2012 file id discussions.

Project proposals can be made by anyone.

We will start the day with New Zealand (GMT +12:00) and end with North America West Coast wrapping up project(s), hopefully with one or two solid deliverables by 12 midnight-ish PST (GMT -8:00).

JHOVE 1.8 beta

A beta version of JHOVE 1.8 is now available for testing. Please report any problems. New stuff:

  • If JHOVE doesn’t find a configuration file, it creates a default one.
  • Generics widely added to clean up the code.
  • Several errors in checking for PDF-A compliance were corrected. Aside from
    fixing some outright bugs, the Contents key for non-text Annotations is
    no longer checked, as its presence is only recommended and not required.
  • Improved code by Hökan Svenson is now used for finding the trailer.
  • TIFF tag 700 (XMP) now accepts field type 7 (UNDEFINED) as well as 1
    (BYTE), on the basis of Adobe’s XMP spec, part 3.
  • If compression scheme 6 is used in a file, an InfoMessage will report
    that the file uses deprecated compression.
  • In WAVE files the Originator Reference property, found in the Broadcast Wave Extension
    (BEXT) chunk, is now reported.