Monthly Archives: February 2013

Getting JHOVE2 to build

There’s a private beta, which should soon be public, of a digital preservation area on StackExchange.com. I took advantage of my invitation to it to ask about something that had stalled me a while ago when I tried to download and build JHOVE2. A quick reply told me that the needed change is simple, just one line in the pom.xml file. I can’t link to my question and the answer on Stack Exchange, since a login is required to view it, but it turns out this issue had already been brought up in a JHOVE2 ticket. The discussion indicates some confusion about whether the issue has been fixed in the main JHOVE2 repository, but Sheila Morrissey has a fork on Bitbucket with the fix.

The fix is to change the URL for “JBoss Repository” in pom.xml to the following:

<url>https://repository.jboss.org/nexus/content/repositories/thirdparty-releases/</url >

Kevin Clarke, who provided the answer, recommends building with the following command line to avoid error messages in the tests:

mvn -DskipTests=true install

Reaching out from L-space, part 2

(This is a continuation of Reaching out from L-Space.)

Let’s look more specifically at digital preservation. This is something that should be of interest to everyone, since we all have files that we want to keep around for a long time, such as photographs. Even so, it doesn’t get wide notice as an area of study outside libraries and archives. All the existing books about it are expensive academic volumes for specialists.

Efforts are being made. The Library of Congress has digitalpreservation.gov, which has a lot of information for the ordinary user. There’s the Personal Digital Archiving Conference, which is coming up shortly.

At PDA 2012, Mike Ashenfelder said in the keynote speech:

Today in 2012, most of the world’s leading cultural institutions are engaged in digital preservation of some sort, and we’re doing quite well after a decade. We have any number of meetings throughout the year — the ECDL, the JCDL, iPres, this — but despite this decade of institutional progress, we’ve neglected the general public, and that’s everybody.

Why hasn’t there been more of an effect from these efforts? One reason may be that they’re pitched at the wrong level, either too high or too low. Technical resources often aren’t user-friendly and are useful only to specialists. The Library of Congress’s efforts are aimed largely at end users, and it’s sometimes very basic and repetitive. A big issue is picking the right level to talk to. We need to engage non-library techies and not just stay inside L-space.

Let’s narrow the focus again and look at JHOVE. It’s a software tool that was developed at Harvard; the design was Stephen Abrams’, and I wrote most of the code. It identifies file formats, validates files, and extracts metadata. Its validation is strictly by the specification. Its error messages are often mysterious, and it doesn’t generally take into account the reality of what kinds of files are accepted. Postel’s law says, “Be conservative in what you do; be liberal in what you accept from others”; but JHOVE doesn’t follow this. As a validation tool, it does need to be on the conservative side, but it may go a bit too far.

JHOVE is useful for preservation specialists, but not so much for the general user. I haven’t tried to change its purpose; it has its user base and they know what to accept of it. There should also be tools, though, for a more general user base.

JHOVE leads to the issue of open source in general. As library software developers, we should be using and creating open-source code. We need to get input from users on what we’re doing. Bram de Werf wrote on the Open Planets Foundation blog:

You will read in most digital preservation survey reports that these same tools are not meeting the needs of the community. At conferences, you will hear complaints about the performance of the tools. BUT, most strikingly, when visiting the sites where these tools are downloadable for free, you will see no signs of an active user community reporting bugs and submitting feature requests. The forums are silent. The open source code is sometimes absent and there are neither community building approaches nor procedures in place for committing code to the open source project.

Creating a community where communication happens is a challenge. Users are shy about making requests and reporting bugs. I don’t have a lot of good answers here. With JHOVE, I’ve had limited success. There was an active community for a while; users not only reported bugs but often submitted working code that I just had to test and incorporate into the release. Now there’s less of that, perhaps because JHOVE has been around for a long time. An open source community requires proactive engagement; you can’t just create a project and expect input. Large projects like Mozilla manage to get a community; for smaller niche projects it’s harder.

Actually, the term “project” is a mistake if you think of it as getting a grant, creating some software, and being done with it. Community involvement needs to be ongoing. Some projects have come out of the development process with functioning code and then immediately died for lack of a community.

Let’s consider format repositories now. An important issue in preservation is figuring out the formats of mysterious files. Repositories with information about lots of different formats are a valuable tool for doing this. The most successful of these is PRONOM, from the UK National Archives. It has a lot of valuable information but also significant holes; the job is too big for one institution to keep up with.

To address this difficulty, there was a project called GDFR — the Global Digital Format Repository. Its idea was that there would be mirrored peer repositories at multiple institutions. This was undertaken by Harvard and OCLC. It never came to a successful finish; it was a very complex design, and there were some communication issues between OCLC and Harvard developers (including me).

A subsequent effort was UDFR, the Unified Digital Format Repository. This eliminated the complications of the mirrored design and delivered a functional website. It’s not a very useful site, though, because there isn’t a lot of format information on it. It wasn’t able to develop the critically necessary community.

A different approach was a project called “Just Solve the Problem.” Rather than developing new software, it uses a wiki. It started with a one-month crowdsourced effort to put together information on as many formats as possible, with pointers to detailed technical information on other sites rather than trying to include it all in the repository. It’s hard to say for sure yet, but this may prove to be a more effective way to create a viable repository.

The basic point here is that preservation outreach needs to be at people’s own level. So what am I doing about it? Well, I have an e-book coming out in April, called Files that Last. It’s aimed at “everygeek”; it assumes more than casual computer knowledge, but not specialization on the reader’s part. It addresses the issues with a focus on practical use. But so much for my book plug.

To recap: L-space is a subspace of “Worldspace,” and we need to reach out to it. We need to engage, and engage in, user communities. Software developers for the library need to reach a broad range of people. We need to start by understanding the knowledge they already have and address them at their level, in their language. We have to help them do things their way, but better.

Reaching out from L-Space

(This article is based on a presentation I made at Dartmouth’s Baker Library on February 7. I’m working from the outline rather than a transcript and have made some changes for the written medium. It’s split into two parts because of its length.)

Terry Pratchett wrote in Guards! Guards!:

It seemed quite logical to the Librarian that, since there were aisles where the shelves were on the outside then there should be other aisles in the space between the books themselves, created out of quantum ripples by the sheer weight of words. There were certainly some odd sounds coming from the other side of some shelving, and the Librarian knew that if he gently pulled out a book or two he would be peeking into different libraries under different skies.

All libraries everywhere are connected in L-space. All libraries. Everywhere.

Right now we’re in the L-space connection between developers and librarians, and the one between librarians and developers on the one hand and students and faculty on the other. L-Space can be a trap, though. If we stay inside it so much that we only talk to each other, we’re missing the whole point of the library’s existence. Pratchett’s Librarian falls a bit short on communication skills, since he’s an orangutan; then again, so do a lot of programmers. Maybe that’s why they call us code monkeys.

The issue of talking tech to non-techies isn’t just for programmers. Librarians are immersed in tech jargon these days: OPACs, MARC records, the OAIS model, etc. Communication levels aren’t just a binary issue. There’s a saying: “There are 10 kinds of people: those who understand binary and those who don’t.” It’s easy to split the world into “us” and “everyone else.” We all have our own sets of assumptions, which we may not realize are there. “Everyone knows” certain things, and those who don’t must be “hopelessly ignorant.” Everyone but the ignorant knows the difference between an application and a file format, Java and JavaScript, what happens in the browser and what happens in the server. It’s easy for any in-group to think of the rest of the world as just outsiders, and for programmers to think of everyone else as computer-illiterate.

However, all people have their own specialties and knowledge. Faculty clearly have their specialties. Students are more comfortable with some kinds of tech, like mobile devices, than many of us are. A good friend of mine is a grocery clerk, and she can teach me things about product codes and scanners. It’s a deadly error to assume that people are too dumb to grasp the benefits of something. This assumption can be harder to work past than actual user ignorance.

For example: I live in a condominium, which is very well-managed on the whole. At one owners’ meeting, though, I pointed out a problem with the PDF newsletters that were being sent by email. They’re sent as scanned images, not as text PDFs, which means they aren’t searchable and people with vision problems can’t take advantage of technologies such as text-to-speech. One of the board members told me I was entirely right, but the owners just weren’t capable of understanding such issues, so it wasn’t worth doing anything. He said this in front of the owners!

People are generally better at solving practical problems than at abstract reasoning. We evolved to survive, not to fit any specific paradigm of knowledge. People understand what they need to understand.

Successful communication happens when the message received equals the message sent. It requires that the parties have a common language, and it can happen only when they share an area of understanding.

Developers need to understand their audience. “Non-programmer” doesn’t mean “non-computer-literate.” Communication needs to be in terms which relate to the audience’s purpose. This comes in two levels for library developers: Talking to library people in library terms, and talking to library users in the terms in which they use the library. We need the help of library people when doing the second.

We’re dealing with a knowledgeable audience: students and faculty. They understand the Internet on a user level. They know how to look for books, even if they do it mostly on Amazon. Students in particular understand mobile devices. Talking below their level is as bad as going over their heads. We need to know what their world is, and we need to address its needs. We need to make the library fit the users’ world.

We have to offer something that’s worth trying out and make it easy to understand. It has to offer something they don’t already have. There’s a saying: “The Internet is the world’s largest library, with all the books on the floor.” The users should get the sense not just that the books are on shelves, but that they control the shelving, that they can organize information the way they need it.

On the whole and on average, users think less analytically than programmers. They don’t see all the consequences of a proposed fix. For instance: Users may complain about having to log back into a system too frequently. The obvious fix for them is to increase session length and time out less often, but they may not think of the loss of security that results, especially on public computers.

Users like DWIM systems — ones that “do what I mean.” These have to guess what the user means. When they guess right, it’s great, but it’s really annoying when they guess wrong. If you’ve ever had a search engine rewrite your search, you know what I mean. Try searching for “droid file tool,” looking for results about the UK National Archives’ file-identification tool called Droid. On Google, you’ll get a bunch of results for “Android.” That’s not the Droid you’re looking for.

Developers need to explain the consequences of a design choice, that getting X implies also getting Y. Figuring out what will really meet the users’ needs, as opposed to what they initially say they want, can be a challenge.

Again, two paths through L-space are needed here. Librarians need to talk the users’ language, and programmers need to talk the librarians’ and the users’ language. Librarians need to assist us in talking the users’ language.

(Continued in part 2)

Future paths for JHOVE

With the next SPRUCE Hackathon coming up, I’m thinking of possible ways to improve JHOVE that I might present there. The home page says, “This hackathon will therefore focus on unifying our community’s approach to characterisation by coordinating existing toolsets and improving their capabilities.” So aside from the general goal of improving JHOVE, coordination is a key point.

I’d posted earlier on some possible enhancements. These are all still possibilities. The focus on coordination brings up other things that could be done. In general, the API hasn’t been given as much thought as the command line interface, and it could be improved without a huge amount of effort. Here are a few thoughts:

  • The API currently requires creating an output stream, such as an XML or text file. It should be possible to call JHOVE and get back an in-memory object. The RepInfo object already serves this purpose; it’s mostly a matter of writing a new method that returns it instead of writing a stream.
  • The caller has the choice of running one module or all the modules in the configuration file and can’t change their order. It might improve efficiency if the caller could specify a list indicating the modules to try and the order in which they should be applied. For instance, a caller might use DROID to get the signature and use this information to pick the module that JHOVE should run first.
  • There’s currently no provision for selecting which output items to generate, except for a few ad hoc options. Would a way to do this, eliminating items that are unwanted, be helpful?
  • Would any additional output handlers, such as JSON, be useful?

I’d welcome any thoughts on which of these, or what other changes, would help JHOVE to coordinate with other applications.