(This is a continuation of Reaching out from L-Space.)
Let’s look more specifically at digital preservation. This is something that should be of interest to everyone, since we all have files that we want to keep around for a long time, such as photographs. Even so, it doesn’t get wide notice as an area of study outside libraries and archives. All the existing books about it are expensive academic volumes for specialists.
Efforts are being made. The Library of Congress has digitalpreservation.gov, which has a lot of information for the ordinary user. There’s the Personal Digital Archiving Conference, which is coming up shortly.
At PDA 2012, Mike Ashenfelder said in the keynote speech:
Today in 2012, most of the world’s leading cultural institutions are engaged in digital preservation of some sort, and we’re doing quite well after a decade. We have any number of meetings throughout the year — the ECDL, the JCDL, iPres, this — but despite this decade of institutional progress, we’ve neglected the general public, and that’s everybody.
Why hasn’t there been more of an effect from these efforts? One reason may be that they’re pitched at the wrong level, either too high or too low. Technical resources often aren’t user-friendly and are useful only to specialists. The Library of Congress’s efforts are aimed largely at end users, and it’s sometimes very basic and repetitive. A big issue is picking the right level to talk to. We need to engage non-library techies and not just stay inside L-space.
Let’s narrow the focus again and look at JHOVE. It’s a software tool that was developed at Harvard; the design was Stephen Abrams’, and I wrote most of the code. It identifies file formats, validates files, and extracts metadata. Its validation is strictly by the specification. Its error messages are often mysterious, and it doesn’t generally take into account the reality of what kinds of files are accepted. Postel’s law says, “Be conservative in what you do; be liberal in what you accept from others”; but JHOVE doesn’t follow this. As a validation tool, it does need to be on the conservative side, but it may go a bit too far.
JHOVE is useful for preservation specialists, but not so much for the general user. I haven’t tried to change its purpose; it has its user base and they know what to accept of it. There should also be tools, though, for a more general user base.
JHOVE leads to the issue of open source in general. As library software developers, we should be using and creating open-source code. We need to get input from users on what we’re doing. Bram de Werf wrote on the Open Planets Foundation blog:
You will read in most digital preservation survey reports that these same tools are not meeting the needs of the community. At conferences, you will hear complaints about the performance of the tools. BUT, most strikingly, when visiting the sites where these tools are downloadable for free, you will see no signs of an active user community reporting bugs and submitting feature requests. The forums are silent. The open source code is sometimes absent and there are neither community building approaches nor procedures in place for committing code to the open source project.
Creating a community where communication happens is a challenge. Users are shy about making requests and reporting bugs. I don’t have a lot of good answers here. With JHOVE, I’ve had limited success. There was an active community for a while; users not only reported bugs but often submitted working code that I just had to test and incorporate into the release. Now there’s less of that, perhaps because JHOVE has been around for a long time. An open source community requires proactive engagement; you can’t just create a project and expect input. Large projects like Mozilla manage to get a community; for smaller niche projects it’s harder.
Actually, the term “project” is a mistake if you think of it as getting a grant, creating some software, and being done with it. Community involvement needs to be ongoing. Some projects have come out of the development process with functioning code and then immediately died for lack of a community.
Let’s consider format repositories now. An important issue in preservation is figuring out the formats of mysterious files. Repositories with information about lots of different formats are a valuable tool for doing this. The most successful of these is PRONOM, from the UK National Archives. It has a lot of valuable information but also significant holes; the job is too big for one institution to keep up with.
To address this difficulty, there was a project called GDFR — the Global Digital Format Repository. Its idea was that there would be mirrored peer repositories at multiple institutions. This was undertaken by Harvard and OCLC. It never came to a successful finish; it was a very complex design, and there were some communication issues between OCLC and Harvard developers (including me).
A subsequent effort was UDFR, the Unified Digital Format Repository. This eliminated the complications of the mirrored design and delivered a functional website. It’s not a very useful site, though, because there isn’t a lot of format information on it. It wasn’t able to develop the critically necessary community.
A different approach was a project called “Just Solve the Problem.” Rather than developing new software, it uses a wiki. It started with a one-month crowdsourced effort to put together information on as many formats as possible, with pointers to detailed technical information on other sites rather than trying to include it all in the repository. It’s hard to say for sure yet, but this may prove to be a more effective way to create a viable repository.
The basic point here is that preservation outreach needs to be at people’s own level. So what am I doing about it? Well, I have an e-book coming out in April, called Files that Last. It’s aimed at “everygeek”; it assumes more than casual computer knowledge, but not specialization on the reader’s part. It addresses the issues with a focus on practical use. But so much for my book plug.
To recap: L-space is a subspace of “Worldspace,” and we need to reach out to it. We need to engage, and engage in, user communities. Software developers for the library need to reach a broad range of people. We need to start by understanding the knowledge they already have and address them at their level, in their language. We have to help them do things their way, but better.
“Digital forensics,” an overused term
Exciting terms get overused and worn down with time. I can remember when “awesome” meant magnificent, extraordinary, awe-inspiring. Today it’s barely stronger than “that’s nice.” Maybe it’s inevitable; people like to use words with a strong punch, even when they’re excessive.
“Digital forensics” is an example. Dictionaries say forensics is the study of issues in public discussion or debate. We usually think about it in connection with technical investigation of legal issues. Was a crime committed? If so, who did it and how? With so much of the world being computerized, people can legitimately use the term for a lot of digital activities, like identifying forgeries and attacks. I used the term for my own investigation of a defect in Honda’s MP3 players.
In the library and archiving world, though, some people are using it just because “data analysis” sounds awfully (there’s another word that’s been worn down) dull. In an interview on the Library of Congress’s digital preservation blog, Kam Woods says:
Occasionally that process does get involved with court cases and suspected misconduct, but he stretches its bounds:
When archivists do their jobs, it prevents controversies from arising in the first place. I’m not demeaning the work; it’s better to prevent uncertainty than to have to resolve it. But good record keeping isn’t forensics.
Sometimes the methods and aims of “digital forensics” and real forensics directly oppose each other. Woods points out that the former needs to avoid collecting sensitive personal information where it’s not appropriate. A real forensic investigation will often need personal data as a vital clue.
People will go on calling routine data analysis “forensics” regardless of anything I say here, but let’s not confuse it with the real thing.
Comments Off on “Digital forensics,” an overused term
Posted in commentary
Tagged forensics, libraries, metadata, preservation