Category Archives: commentary

The future of WebM

Yesterday I posted about the WebP still image format, expressing some skepticism about how easily it will catch on. Its companion format for video, WebM, may stand a better chance, though. Images aren’t exciting any more; JPEG delivers photographs well enough, PNG does the same for line art, and there isn’t a compelling reason to change. Video is still in flux, though, and the high bandwidth requirements mean there’s a payoff for any improvements in compression and throughput. The long-running battle among HTML5 stakeholders over video shows that it’s far from being a settled area. Patents are a big issue; if you implement H.264, you have to pay money. Alternatives are attractive from both a technological and an economic standpoint.

With Google pushing WebM and having YouTube, there’s a clear reason for browser developers to support it. YouTube plans to use the new WebM codec, VP9, once it’s complete. I haven’t seen details of the plan, but most likely YouTube will make the same video available with multiple protocols and query the browser’s capabilities to determine whether it can accept VP9. If the advantage is real and users who can get it see fewer pauses in their videos, more browser makers will undoubtedly join the bandwagon.

An eye on WebP

Google has been promoting the WebP still image format for some time, and lately Facebook has added its support. It’s hard to displace the well-entrenched JPEG, but it could happen. It supports both lossy and lossless compression, and Google claims it offers a significant advantage in compression over PNG and JPEG. Google says it’s free of patent restrictions; the container is the familiar RIFF. The VP8 lossy format is available as an IETF RFC; a specification for the lossless format is also available.

The container spec supports XMP and Exif metadata. Canvas width and height can be as much as 16,777,216 pixels, though their product is limited to 4,294,967,296 pixels. As far as I can tell it doesn’t support tiling, though, so partial rendering of huge images in the style of JPEG2000 may not be practical.

Chrome, Opera, and Ice Cream Sandwich offer WebP support, but not many other browsers do. Facebook’s offerings of WebP images have resulted in complaints from users whose browsers can’t read the format. The Firefox development team is starting to warm to it but hasn’t committed to anything yet. Internet Explorer hasn’t even reached that point.

It’s still early to make bets, but WebP increasingly bears watching. I’ve initiated a page for updates and errata for Files that Last with some updated information on WebP. (When I wrote the book, I couldn’t find the lossless spec.)

Patent application strikes at digital archiving

Someone called Henry Gladney has filed a US patent application which could be used to troll digital archiving operations in an attempt to force them to pay money for what they’ve been doing all along. The patent is more readable than many I’ve seen, and it’s simply a composite of existing standard practices such as schema-based XML, digital authentication, public key authentication, and globally unique identifiers. The application openly states that its PIP (Preservation Information Package) “is also an Archival Information Package as described within the forthcoming ISO OAIS standard.”

I won’t say this is unpatentable; all kinds of absurd software patents have been granted. As far as I’m concerned, software patents are inherently absurd; every piece of software is a new invention, each one builds on techniques used in previously written software, and the pace at which this happens makes a patent’s lifetime of fourteen to twenty years an eternity. If the first person to use any software technique were consistently deemed to own it and others were required to get permission to reuse it, we’d never have ventured outside the caves of assembly language. That’s not the view Congress takes, though.

Patent law does say, though, that you can’t patent something that’s already been done; the term is “prior art.” I can’t see anything in the application that’s new beyond the specific implementation. If it’s only that implementation which is patented, then archivists can and will simply use a different structure and not have to pay patent fees. If the application is granted and is used to get money out of anyone who creates archiving packages, there will be some nasty legal battles ahead, further demonstrating how counterproductive the software patent system is.

Update: There’s discussion on LinkedIn. Registration is required to comment, but not to just read.

Getting JHOVE2 to build

There’s a private beta, which should soon be public, of a digital preservation area on StackExchange.com. I took advantage of my invitation to it to ask about something that had stalled me a while ago when I tried to download and build JHOVE2. A quick reply told me that the needed change is simple, just one line in the pom.xml file. I can’t link to my question and the answer on Stack Exchange, since a login is required to view it, but it turns out this issue had already been brought up in a JHOVE2 ticket. The discussion indicates some confusion about whether the issue has been fixed in the main JHOVE2 repository, but Sheila Morrissey has a fork on Bitbucket with the fix.

The fix is to change the URL for “JBoss Repository” in pom.xml to the following:

<url>https://repository.jboss.org/nexus/content/repositories/thirdparty-releases/</url >

Kevin Clarke, who provided the answer, recommends building with the following command line to avoid error messages in the tests:

mvn -DskipTests=true install

Reaching out from L-space, part 2

(This is a continuation of Reaching out from L-Space.)

Let’s look more specifically at digital preservation. This is something that should be of interest to everyone, since we all have files that we want to keep around for a long time, such as photographs. Even so, it doesn’t get wide notice as an area of study outside libraries and archives. All the existing books about it are expensive academic volumes for specialists.

Efforts are being made. The Library of Congress has digitalpreservation.gov, which has a lot of information for the ordinary user. There’s the Personal Digital Archiving Conference, which is coming up shortly.

At PDA 2012, Mike Ashenfelder said in the keynote speech:

Today in 2012, most of the world’s leading cultural institutions are engaged in digital preservation of some sort, and we’re doing quite well after a decade. We have any number of meetings throughout the year — the ECDL, the JCDL, iPres, this — but despite this decade of institutional progress, we’ve neglected the general public, and that’s everybody.

Why hasn’t there been more of an effect from these efforts? One reason may be that they’re pitched at the wrong level, either too high or too low. Technical resources often aren’t user-friendly and are useful only to specialists. The Library of Congress’s efforts are aimed largely at end users, and it’s sometimes very basic and repetitive. A big issue is picking the right level to talk to. We need to engage non-library techies and not just stay inside L-space.

Let’s narrow the focus again and look at JHOVE. It’s a software tool that was developed at Harvard; the design was Stephen Abrams’, and I wrote most of the code. It identifies file formats, validates files, and extracts metadata. Its validation is strictly by the specification. Its error messages are often mysterious, and it doesn’t generally take into account the reality of what kinds of files are accepted. Postel’s law says, “Be conservative in what you do; be liberal in what you accept from others”; but JHOVE doesn’t follow this. As a validation tool, it does need to be on the conservative side, but it may go a bit too far.

JHOVE is useful for preservation specialists, but not so much for the general user. I haven’t tried to change its purpose; it has its user base and they know what to accept of it. There should also be tools, though, for a more general user base.

JHOVE leads to the issue of open source in general. As library software developers, we should be using and creating open-source code. We need to get input from users on what we’re doing. Bram de Werf wrote on the Open Planets Foundation blog:

You will read in most digital preservation survey reports that these same tools are not meeting the needs of the community. At conferences, you will hear complaints about the performance of the tools. BUT, most strikingly, when visiting the sites where these tools are downloadable for free, you will see no signs of an active user community reporting bugs and submitting feature requests. The forums are silent. The open source code is sometimes absent and there are neither community building approaches nor procedures in place for committing code to the open source project.

Creating a community where communication happens is a challenge. Users are shy about making requests and reporting bugs. I don’t have a lot of good answers here. With JHOVE, I’ve had limited success. There was an active community for a while; users not only reported bugs but often submitted working code that I just had to test and incorporate into the release. Now there’s less of that, perhaps because JHOVE has been around for a long time. An open source community requires proactive engagement; you can’t just create a project and expect input. Large projects like Mozilla manage to get a community; for smaller niche projects it’s harder.

Actually, the term “project” is a mistake if you think of it as getting a grant, creating some software, and being done with it. Community involvement needs to be ongoing. Some projects have come out of the development process with functioning code and then immediately died for lack of a community.

Let’s consider format repositories now. An important issue in preservation is figuring out the formats of mysterious files. Repositories with information about lots of different formats are a valuable tool for doing this. The most successful of these is PRONOM, from the UK National Archives. It has a lot of valuable information but also significant holes; the job is too big for one institution to keep up with.

To address this difficulty, there was a project called GDFR — the Global Digital Format Repository. Its idea was that there would be mirrored peer repositories at multiple institutions. This was undertaken by Harvard and OCLC. It never came to a successful finish; it was a very complex design, and there were some communication issues between OCLC and Harvard developers (including me).

A subsequent effort was UDFR, the Unified Digital Format Repository. This eliminated the complications of the mirrored design and delivered a functional website. It’s not a very useful site, though, because there isn’t a lot of format information on it. It wasn’t able to develop the critically necessary community.

A different approach was a project called “Just Solve the Problem.” Rather than developing new software, it uses a wiki. It started with a one-month crowdsourced effort to put together information on as many formats as possible, with pointers to detailed technical information on other sites rather than trying to include it all in the repository. It’s hard to say for sure yet, but this may prove to be a more effective way to create a viable repository.

The basic point here is that preservation outreach needs to be at people’s own level. So what am I doing about it? Well, I have an e-book coming out in April, called Files that Last. It’s aimed at “everygeek”; it assumes more than casual computer knowledge, but not specialization on the reader’s part. It addresses the issues with a focus on practical use. But so much for my book plug.

To recap: L-space is a subspace of “Worldspace,” and we need to reach out to it. We need to engage, and engage in, user communities. Software developers for the library need to reach a broad range of people. We need to start by understanding the knowledge they already have and address them at their level, in their language. We have to help them do things their way, but better.

Reaching out from L-Space

(This article is based on a presentation I made at Dartmouth’s Baker Library on February 7. I’m working from the outline rather than a transcript and have made some changes for the written medium. It’s split into two parts because of its length.)

Terry Pratchett wrote in Guards! Guards!:

It seemed quite logical to the Librarian that, since there were aisles where the shelves were on the outside then there should be other aisles in the space between the books themselves, created out of quantum ripples by the sheer weight of words. There were certainly some odd sounds coming from the other side of some shelving, and the Librarian knew that if he gently pulled out a book or two he would be peeking into different libraries under different skies.

All libraries everywhere are connected in L-space. All libraries. Everywhere.

Right now we’re in the L-space connection between developers and librarians, and the one between librarians and developers on the one hand and students and faculty on the other. L-Space can be a trap, though. If we stay inside it so much that we only talk to each other, we’re missing the whole point of the library’s existence. Pratchett’s Librarian falls a bit short on communication skills, since he’s an orangutan; then again, so do a lot of programmers. Maybe that’s why they call us code monkeys.

The issue of talking tech to non-techies isn’t just for programmers. Librarians are immersed in tech jargon these days: OPACs, MARC records, the OAIS model, etc. Communication levels aren’t just a binary issue. There’s a saying: “There are 10 kinds of people: those who understand binary and those who don’t.” It’s easy to split the world into “us” and “everyone else.” We all have our own sets of assumptions, which we may not realize are there. “Everyone knows” certain things, and those who don’t must be “hopelessly ignorant.” Everyone but the ignorant knows the difference between an application and a file format, Java and JavaScript, what happens in the browser and what happens in the server. It’s easy for any in-group to think of the rest of the world as just outsiders, and for programmers to think of everyone else as computer-illiterate.

However, all people have their own specialties and knowledge. Faculty clearly have their specialties. Students are more comfortable with some kinds of tech, like mobile devices, than many of us are. A good friend of mine is a grocery clerk, and she can teach me things about product codes and scanners. It’s a deadly error to assume that people are too dumb to grasp the benefits of something. This assumption can be harder to work past than actual user ignorance.

For example: I live in a condominium, which is very well-managed on the whole. At one owners’ meeting, though, I pointed out a problem with the PDF newsletters that were being sent by email. They’re sent as scanned images, not as text PDFs, which means they aren’t searchable and people with vision problems can’t take advantage of technologies such as text-to-speech. One of the board members told me I was entirely right, but the owners just weren’t capable of understanding such issues, so it wasn’t worth doing anything. He said this in front of the owners!

People are generally better at solving practical problems than at abstract reasoning. We evolved to survive, not to fit any specific paradigm of knowledge. People understand what they need to understand.

Successful communication happens when the message received equals the message sent. It requires that the parties have a common language, and it can happen only when they share an area of understanding.

Developers need to understand their audience. “Non-programmer” doesn’t mean “non-computer-literate.” Communication needs to be in terms which relate to the audience’s purpose. This comes in two levels for library developers: Talking to library people in library terms, and talking to library users in the terms in which they use the library. We need the help of library people when doing the second.

We’re dealing with a knowledgeable audience: students and faculty. They understand the Internet on a user level. They know how to look for books, even if they do it mostly on Amazon. Students in particular understand mobile devices. Talking below their level is as bad as going over their heads. We need to know what their world is, and we need to address its needs. We need to make the library fit the users’ world.

We have to offer something that’s worth trying out and make it easy to understand. It has to offer something they don’t already have. There’s a saying: “The Internet is the world’s largest library, with all the books on the floor.” The users should get the sense not just that the books are on shelves, but that they control the shelving, that they can organize information the way they need it.

On the whole and on average, users think less analytically than programmers. They don’t see all the consequences of a proposed fix. For instance: Users may complain about having to log back into a system too frequently. The obvious fix for them is to increase session length and time out less often, but they may not think of the loss of security that results, especially on public computers.

Users like DWIM systems — ones that “do what I mean.” These have to guess what the user means. When they guess right, it’s great, but it’s really annoying when they guess wrong. If you’ve ever had a search engine rewrite your search, you know what I mean. Try searching for “droid file tool,” looking for results about the UK National Archives’ file-identification tool called Droid. On Google, you’ll get a bunch of results for “Android.” That’s not the Droid you’re looking for.

Developers need to explain the consequences of a design choice, that getting X implies also getting Y. Figuring out what will really meet the users’ needs, as opposed to what they initially say they want, can be a challenge.

Again, two paths through L-space are needed here. Librarians need to talk the users’ language, and programmers need to talk the librarians’ and the users’ language. Librarians need to assist us in talking the users’ language.

(Continued in part 2)

JHOVE statistics

Here are a few statistics on JHOVE, taken from SourceForge. The period I checked is from January 1, 2012, through January 29, 2013.

Total downloads, all files: 3,081
Downloads for Windows: 2,160
Linux: 350
Macintosh: 294

Top 5 countries:
United States: 831
Germany: 316
Spain: 235
France: 184
Canada: 129

Releases of JHOVE since I left Harvard: 2

Total income from JHOVE since I left Harvard: $12.70 (from sales of JHOVE Tips for Developers)

“Digital forensics”

Now and then I see talk about “digital forensics.” It’s never clear what it’s supposed to mean. “Forensic” means “belonging to, used in, or suitable to courts of judicature or to public discussion and debate.” In popular usage, it’s generally applied to criminal investigations, especially in the phrase “forensic medicine.”

Some activities could be called digital forensics, where digital methods help to resolve contentious issues. For instance, textual analysis might shed light on an author’s identity. Digital techniques can even solve crimes. Too often, though, the term is getting stretched beyond meaningfulness, to the point that routine curation practices are called “forensics.”

No doubt it feels glamorous to think of oneself as the CSI of libraries, but let’s not get carried away with buzzwords.

A preservation hazard in OpenOffice

While playing with OpenOffice in my research for Files that Last, I came across a preservation risk. I copied an image from a website and pasted it into a text document, then looked at the resulting XML. The image data wasn’t anywhere in content.xml or anywhere else in the overall ZIP document. Instead, I found this:


<draw:image
xlink:href="http://plan-b-for-openoffice.org/resources/images/x180x60_3_get.png.pagespeed.ic.fjV0teeVb_.png"
xlink:type="simple"
xlink:show="embed"
xlink:actuate="onLoad"/>

The source for the image is on the Web. This means that if the URL stops working, the document loses the image. That’s a poor plan for long-term storage.

The way to avoid this is to use Edit > Paste special and paste the image as a bitmap. It can be a pain to remember to do this. You may be able to catch images that are pasted by reference, since there can be a brief delay while just a box with the URL is displayed before the image comes up.

Sneaky little preservation hazards like this (and the earlier one mentioned with Adobe Illustrator files) are the kind of thing you’ll find when Files that Last comes out.

When is a PDF not a PDF?

Yesterday I was doing some experiments with Adobe Illustrator. According to some web sites, The CS5 version saves its files as PDF, though with the extension .AI. When you save a file, though, the options dialog has a checkbox labeled “Create PDF Compatible File.” I unchecked it and saved the file, then opened it in JHOVE. JHOVE says it’s perfectly good PDF — indeed, PDF/A. Then I tried opening it in Preview, and this is what it looked like:

File says over and over that it was saved without PDF content

If you don’t actually look at the file but trust the mere fact that it’s a PDF, you might put it into a repository and find out later on that it’s worthless as a PDF. What’s happening is that PDF can embed any kind of content, and this one embeds its native PGF data. Any PDF reader can open the file, but only an application that understands PGF can use its actual content. Anyone putting PDF into a repository should be aware of this risk.

It’s outside the scope of JHOVE to check whether embedded content is acceptable to PDF/A, so the claim that it’s correct PDF/A is probably spurious. It is, however, definitely legal PDF.

This type of situation helps to show why PDF/A-3 is a bad idea.