Tag Archives: preservation

Funding for preservation software development

The Open Preservation Foundation (formerly the Open Planets Foundation) is launching a new model for funding the development of preservation-related software. Quoting from the announcement:

‘Over the last year the OPF has established a solid foundation for ensuring the sustainability of digital preservation technology and knowledge,’ explains Dr. Ross King, Chair of the OPF Board. ‘Our new strategic plan was introduced in November 2014 along with community surveys to establish the current state of the art. We developed our annual plan in consultation with our members and added JHOVE to our growing software portfolio. The new membership and software supporter models are the next steps towards realising our vision and mission.’ …

The software supporter model allows organisations to support individual digital preservation software products and ensure their ongoing sustainability and maintenance. We are launching support for JHOVE based on its broad adoption and need for active stewardship. It is also a component in several leading commercial digital preservation solutions. While it remains fully open source, supporters can steer our stewardship and maintenance activities and receive varying levels of technical support and training.

JHOVE logoI have a selfish personal interest in spreading the word. At the moment, I’m between contracts, and I wouldn’t mind getting some funding from OPF to resume development work on JHOVE. I know its code base better than anyone else, I worked on it without pay as a hobby for a year or so after leaving Harvard, and I’d enjoy working on it some more if I could just get some compensation. This is possible, but only if there’s support from outside.

US libraries have been rather insular in their approach to software development. They’ll use free software if it’s available, but they aren’t inclined to help fund it. If they could each set aside some money for this purpose, it would help assure the continued creation and maintenance of the open source software which is important to their mission.

How about it, Harvard?

Dataliths vs. the digital dark age

Digital technology has allowed us to store more information at less cost than ever before. At the same time, it’s made this information very fragile in the long term. A book can sit in an abandoned building for centuries and still be readable. Writing carved in stone can last for thousands of years. The chances that your computer’s disk will be readable in a hundred years are poor, though. You’ll have to go to a museum for hardware and software to read it. Once you have all that, it probably won’t even spin up. If it does, the bits may be ruined. In five hundred years, its chance of readability will be essentially zero.

Archivists are aware of this, of course, and they emphasize the need for continual migration. Every couple of decades, at least, stored documents need to be moved to new media and perhaps updated to a new format. Digital copies, if made with reasonable precautions, are perfect. This approach means that documents can be preserved forever, provided the chain never breaks.

Fortunately, there doesn’t have to be just one chain. The LOCKSS (lots of copies keeps stuff safe) principle means that the same document can be stored in archives all over the world. As long as just one of them keeps propagating it, the document will survive.

Does this make us really safe from the prospect of a digital dark age? Will a substantial body of today’s knowledge and literature survive until humans evolve into something so different that it doesn’t matter any more? Not necessarily. To be really safe, information needs to be stored in a form that can survive long periods of neglect. We need dataliths.

Several scenarios could lead to neglect of electronic records for a generation or more. A global nuclear war could destroy major institutions, wreck electronic devices with EMPs, and force people to focus on staying alive. An asteroid hit or a supervolcano eruption could have a similar effect. Humanity might surive these things but take a century or more to return to a working technological society.

Less spectacularly, periods of intense international fear or attempts to manage the world economy might create an unfriendly climate for preserving records of the past. The world might go through a period of severe censorship. Lately religious barbarians have been sacking cities and destroying historical records that don’t fit with their doctrines. Barbarians generally burn themselves out quickly, but “enlightened” authorities can also decide that all “unenlightened” ideas should be banished for the good of us all. Prestigious institutions can be especially vulnerable to censorship because of their visibility and dependence on broad support. Even without legal prohibition, archival culture may shift to decide that some ideas aren’t worth preserving. Either way, it won’t be called censorship; it will be called “fair speech,” “fighting oppression,” “the right to be forgotten,” or some other euphemism that hasn’t yet lost credibility.

How great is the risk of these scenarios? Who can say? To calculate odds, you need repeatable causes, and the technological future will be a lot different from the comparatively low-tech past. But if we’re thinking on a span of thousands of years, we can’t dismiss it as negligible. Whatever may happen, the documents of the past are too valuable to be maintained only by their official guardians.

Hard copy will continue to be important. It’s also subject to most of the forms of loss I’ve mentioned, but some of it can survive for many years with no attention. As long as someone can understand the language it’s written in, or as long as its pictures remain recognizable, it has value. However, we can’t back away from digital storage and print everything we want to preserve. The advantages of bits are clear: easy reproduction and high storage density. This isn’t to say that archivists should abandon the strategy of storing documents with the best technology and migrating them regularly. In good times, that’s the most effective approach. But the bigger strategy should include insurance against the bad times, a form of storage that can survive neglect. Ideally it shouldn’t be in the hands of Harvard or the Library of Congress, but of many “guerilla archivists” acting on their own.

This strategy requires a storage medium which is highly durable and relatively simple to read. It doesn’t have to push the highest edges of storage density. It should be the modern equivalent of the stone tablet, a datalith.

There are devices which tend in this direction. Milleniata claims to offer “forever storage” in its M-Disc. Allegedly it has been “proven to last 1,000 years,” though I wonder how they managed to start testing in the Middle Ages. A DVD uses a complicated format, though, so it may not be readable even if it physically lasts that long. Hitachi has been working on quartz glass data storage that could last for millions of years and be read with an optical microscope. This would be the real datalith. As long as some people still know today’s languages, pulling out ASCII data should be a very simple cryptographic task. Unfortunately, the medium isn’t commercially available yet. Others have worked on similar ideas, such as the Superman memory crystal. Ironically, that article, which proclaims “the first document which will likely survive the human race,” has a broken link to its primary source less than two years after its publication.

Hopefully datalith writers will be available before too long, and after a few years they won’t be outrageously expensive. The records they create will be an important part of the long-term preservation of knowledge.

New open-source file validation project

The VeraPDF Consortium has announced that it has begun the prototyping phase for a new open-source validator of PDF/A. This is a piece of the PREFORMA (PREservation FORMAts) project; other branches will cover TIFF and audio-visual formats. Participants in VeraPDF are the Open Preservation Foundation, the PDF Association, the Digital Preservation Coalition, Dual Lab, and Keep Solutions.

Documents are available, including a functional and technical specification. It aims at being the “definitive” tool for determining if a PDF document conforms to the ISO 19005 requirements. It will separate the PDF parser from the higher-level validation, so a different parser can be plugged in.

Validating PDF is tough In JHOVE, I designed PDF/A validation as an afterthought to the PDF module. PDF/A requirements affect every level of the implementation, so that approach led to problems that never entirely went away. Making PDF/A validation a primary goal should help greatly, but having it sit on top of and independent from the PDF parser may introduce another form of the same problem.

PDF files can include components which are outside the spec, and PDF/A-3 permits their inclusion. This means that really validating PDF/A-3 is an open-ended task. Even in the earlier version of PDF/A, not everything that can be put into a file is covered by the PDF specification per se. The specification addresses this by providing for extensibility; add-ons can address these aspects as desired. In particular, the core validator won’t attempt thorough validation of fonts.

A Metadata Fixer will not just check documents for conformance, but in some cases will perform the necessary fixes to make a file PDF/A compliant.

JHOVE ignores the content streams, focusing only on the structure, so it could report a thoroughly broken file as well-formed and valid. JHOVE2 doesn’t list PDF in its modules. Analyzing the content stream data is a big task. In general, the project looks hugely ambitious, and not every ambitious digital preservation project has reached a successful end. If this one does, it will be a wonderful accomplishment.

Update on the JHOVE handover

There’s a brief piece by Becky McGuinness in D-Lib Magazine on the handover of JHOVE to the Open Preservation Foundation. It describes upcoming plans:

During March the OPF will be working with Portico and other members to complete the transfer of JHOVE to its new home. The latest code base will move to the OPF GitHub organisation page. All documentation, source code files, and full change history will be publicly available, alongside other OPF supported software projects, including JHOVE2, Fido, jpylyzer, and the SCAPE project tools.

Once the initial transfer is complete the next step will be to set up a continuous integration (CI) build on Travis, an online CI service that’s integrated with GitHub. This will ensure that all new code submissions are built and tested publicly and automatically, including all external pull requests. This will establish a firm foundation for future changes based on agile software development best practises.

With this foundation in place OPF will test and incorporate JHOVE fixes from the community into the new project. Several OPF members have already developed fixes based on their own automated processes, which they will be releasing to the community. Working as a group these fixes will be examined and tested methodically. At the same time the OPF’s priority will be to produce a Debian package that can be downloaded and installed from its apt repository.

Following the transfer OPF will gather requirements from its members and the wider digital preservation community. The OPF aims to establish and oversee a self-sustaining community around JHOVE that will take these requirements forward, carrying out roadmapping exercises for future development and maintenance. The OPF will also assess the need for specific training and support material for JHOVE such as documentation and online or virtual machine demonstrators.

It’s great to know that JHOVE still has a future a decade after its birth, but what boggles my mind is the next sentence:

The transfer of JHOVE is supported by its creators and developers: Harvard Library, Portico, the California Digital Library, and Gary McGath.

I never expected to see my name in a list like that!

A new home for JHOVE

Over a decade ago, the Harvard University Libraries took me on as a contractor to start work on JHOVE. Later I became an employee, and JHOVE formed an important part of my work. When I left Harvard, I asked for continued “custody” of JHOVE so I could keep maintaining it, and got it. Over time it became less of a priority for me; there’s only so much time you can devote to something when no one’s paying you to do it.

After a long period of discussion, the Open Preservation Foundation (formerly the Open Planets Foundation) has taken up support of JHOVE. In addition to picking up the open source software, it’s resolved copyright issues in the documentation with Harvard, really over boilerplate that no one intended to enforced, but still an issue that had to be cleared.

Stephen Abrams, who was the real father of JHOVE, said, “We’re very pleased to see this transfer of stewardship responsibility for JHOVE to the OPF. It will ensure the continuity of maintenance, enhancement, and availability between the original JHOVE system and its successor JHOVE2, both key infrastructural components in wide use throughout the digital library community.”

JHOVE2 was originally supposed to be the successor to JHOVE, but it didn’t get enough funding to cover all the formats that JHOVE covers, so both are used, and the confusion of names is unfortunate. OPF has both in its portfolio. It doesn’t appear to have forked JHOVE to its Github repository yet, but I’m sure that’s coming soon.

My own Github repository for JHOVE should now be considered archival. Go forth and prosper, JHOVE.

Article on PDF/A validation with JHOVE

An article by Yvonne Friese does a good job of explaining the limitations of JHOVE in validating PDF/A. At the time that I wrote JHOVE, I wasn’t aware how few people had managed to write a PDF validator independent of Adobe’s code base; if I’d known, I might have been more intimidated. It’s a complex job, and adding PDF/A validation as an afterthought added to the problems. JHOVE validates only the file structure, not the content streams, so it can miss errors that make a file unusable. Finally, I’ve never updated JHOVE to PDF 1.7, so it doesn’t address PDF/A-2 or 3.

I do find the article flattering; it’s nice to know that even after all these years, “many memory institutions use JHOVE’s PDF module on a daily basis for digital long term archiving.” The Open Preservation Foundation is picking up JHOVE, and perhaps it will provide some badly needed updates.

Open Planets Foundation is now Open Preservation Foundation

The Open Planets Foundation is now the Open Preservation Foundation. This name change reflects its function; the old name grew out of the Planets project and never really made sense.

For the present, it’s still found on the Internet as openplanetsfoundation.org.

Library of Congress format recommendations

The Library of Congress has issued a set of recommendations for formats for both physical and digital documents. The LoC’s digital preservation blog has an interview with Ted Westervelt of the LoC on their development. They’re not just for the library’s own staff, he explains, but for “all stakeholders in the creative process.”

The guidelines repeatedly state: “Files must contain no measures that control access to or use of the digital work (such as digital rights management or encryption).” That’s pushback that can’t be ignored. In some cases, though, the message is mixed. For theatrically released films, standard or recordable Blu-Ray is accepted, but the boilerplate against DRM is included. I don’t know where they expect to get DRM-free Blu-Ray, but DRM-free options are few when it comes to big-name movies.

It’s also interesting that software, specifically games and learning materials, is included. This has been a growing area of interest in recent years. Rather than relying on emulation, the recommendations call for source code, documentation, and a specification of the exact compiler used to build the application.

There’s material here to fuel constructive debate and expansion for years.