Today is International Digital Preservation Day.
In honor of the day, I’m offering Files that Last: Digital Preservation for Everygeek on Smashwords at its lowest price ever. Today only, you can get it for $0.99 with the coupon code
AM26N. This is a one-day sale, so get it now if you don’t already have it!
There are new releases of VeraPDF and JHOVE today.
This XKCD cartoon showed up in my Twitter feed more times in one day than any previous one, for reasons that should be obvious.
Is PDF/A a good archival format? Many institutions use it, but it has problems which are inherent in PDF. With PDF/A-3, it has lost some of its focus. A format which can be a container for any kind of content isn’t great for digital preservation.
An article by Marco Klindt of the Zuse Institute Berlin takes a strong position against its suitability, with the title “PDF/A considered harmful for digital preservation.” Carl Wilson at the Open Preservation Foundation has added his own thoughts with “PDF/A and Long Term Preservation.”
The Library of Congress has reorganized its site on file format sustainability and given it a new URL. (The old one redirects there.) A blog entry discusses the change. Relationships among formats are a big part of the site. It’s significant, for instance, that the MP3 encoding and the de facto MP3 file format get separate entries.
My reactions are mixed. When you click “Format Descriptions” on the main page, you get a page titled “Format Description Categories.” The nesting description at the top says you’re in “Format Descriptions as XML.” Eight categories are listed, and two formats plus “All xxx format descriptions” are listed under each category. There’s no obvious reason why those two formats get special prominence, or what the page has to do with XML.
My brief post yesterday on the TI/A initiative provoked a lively discussion on Twitter, mostly on whether archival formats should allow compression. The argument against compression rests on the argument that archives should be able to deal with files that have a few bit errors in them. This is a badly mistaken idea.
Today’s XKCD comic comments on digital preservation in Randall Munroe’s usual style.
How big a concern is physical degradation of files, aka “bit rot,” to digital preservation? Should archives eschew data compression in order to minimize the effect of lost bits? In most of my experience, no one’s raised that as a major concern, but some contributors to the TI/A initiative consider it important enough to affect their recommendations.
The Open Preservation Foundation has just announced JHOVE 1.14. The numbering is a bit odd. Version 1.12 never made it to release, and they seem to have skipped 1.13 entirely.
This includes three new modules: the PNG module, which I wrote on a weekend whim, and GZIP and WARC modules adapted from JHOVE2. The UTF-8 module now supports Unicode 7.0.
The release isn’t showing up yet on the OPF website, but I expect that will happen momentarily.
It’s nice to see that the code which I started working on over a decade ago is still alive and useful. Congratulations and thanks to Carl Wilson, who’s now its principal maintainer!
The Unified Digital Format Registry (UDFR), created and maintained by the California Digital Library, will shut down on April 15, 2016. I don’t know whether the whole site will go away or just the ability to query the registry.
Information Standards Quarterly has an article on UDFR by Andrea Goethals. The source code repository is on GitHub.
The predecessor project, GDFR, never got to publicly usable status. The site gdfr.info still responds to pings, but apparently not to HTTP requests.
Quoting its description here, so it’s saved in at least one place if the site completely goes away:
The UDFR is a reliable, publicly accessible, and sustainable knowledge base of file format representation information for use by the digital preservation community.
A format is a set of semantic and syntactic rules governing the mapping between abstract information and its representation in digital form. While many worthwhile and necessary preservation activities can be performed on a digital asset without knowledge of its format, that is, merely as a sequence of bits, any higher-level preservation of the underlying information content must be performed in the context of the asset’s format.
The UDFR seeks to “unify” the function and holdings of two existing registries, PRONOM and GDFR (the Global Digital Format Registry), in an open source, semantically enabled, and community supported platform.
The UDFR was developed by the University of California Curation Center (UC3) at the California Digital Library (CDL), funded by the Library of Congress as part of its National Digital Information Infrastructure Preservation Program (NDIIPP). The service is implemented on top of the OntoWiki semantic wiki and Virtuoso triple store.
Posted in News
Tagged preservation, UDFR