Is PDF/A a good archival format? Many institutions use it, but it has problems which are inherent in PDF. With PDF/A-3, it has lost some of its focus. A format which can be a container for any kind of content isn’t great for digital preservation.
An article by Marco Klindt of the Zuse Institute Berlin takes a strong position against its suitability, with the title “PDF/A considered harmful for digital preservation.” Carl Wilson at the Open Preservation Foundation has added his own thoughts with “PDF/A and Long Term Preservation.”
The Library of Congress has reorganized its site on file format sustainability and given it a new URL. (The old one redirects there.) A blog entry discusses the change. Relationships among formats are a big part of the site. It’s significant, for instance, that the MP3 encoding and the de facto MP3 file format get separate entries.
My reactions are mixed. When you click “Format Descriptions” on the main page, you get a page titled “Format Description Categories.” The nesting description at the top says you’re in “Format Descriptions as XML.” Eight categories are listed, and two formats plus “All xxx format descriptions” are listed under each category. There’s no obvious reason why those two formats get special prominence, or what the page has to do with XML.
My brief post yesterday on the TI/A initiative provoked a lively discussion on Twitter, mostly on whether archival formats should allow compression. The argument against compression rests on the argument that archives should be able to deal with files that have a few bit errors in them. This is a badly mistaken idea.
Today’s XKCD comic comments on digital preservation in Randall Munroe’s usual style.
How big a concern is physical degradation of files, aka “bit rot,” to digital preservation? Should archives eschew data compression in order to minimize the effect of lost bits? In most of my experience, no one’s raised that as a major concern, but some contributors to the TI/A initiative consider it important enough to affect their recommendations.
The Open Preservation Foundation has just announced JHOVE 1.14. The numbering is a bit odd. Version 1.12 never made it to release, and they seem to have skipped 1.13 entirely.
This includes three new modules: the PNG module, which I wrote on a weekend whim, and GZIP and WARC modules adapted from JHOVE2. The UTF-8 module now supports Unicode 7.0.
The release isn’t showing up yet on the OPF website, but I expect that will happen momentarily.
It’s nice to see that the code which I started working on over a decade ago is still alive and useful. Congratulations and thanks to Carl Wilson, who’s now its principal maintainer!
The Unified Digital Format Registry (UDFR), created and maintained by the California Digital Library, will shut down on April 15, 2016. I don’t know whether the whole site will go away or just the ability to query the registry.
Information Standards Quarterly has an article on UDFR by Andrea Goethals. The source code repository is on GitHub.
The predecessor project, GDFR, never got to publicly usable status. The site gdfr.info still responds to pings, but apparently not to HTTP requests.
Quoting its description here, so it’s saved in at least one place if the site completely goes away:
The UDFR is a reliable, publicly accessible, and sustainable knowledge base of file format representation information for use by the digital preservation community.
A format is a set of semantic and syntactic rules governing the mapping between abstract information and its representation in digital form. While many worthwhile and necessary preservation activities can be performed on a digital asset without knowledge of its format, that is, merely as a sequence of bits, any higher-level preservation of the underlying information content must be performed in the context of the asset’s format.
The UDFR seeks to “unify” the function and holdings of two existing registries, PRONOM and GDFR (the Global Digital Format Registry), in an open source, semantically enabled, and community supported platform.
The UDFR was developed by the University of California Curation Center (UC3) at the California Digital Library (CDL), funded by the Library of Congress as part of its National Digital Information Infrastructure Preservation Program (NDIIPP). The service is implemented on top of the OntoWiki semantic wiki and Virtuoso triple store.
Posted in News
Tagged preservation, UDFR
Almost all the published books on digital preservation are academic writing for a very limited audience. My own Files that Last wasn’t intended for a tiny audience but ended up that way. The chances look better for Abby Smith Rumsey’s upcoming When We Are No More: How Digital Memory Is Shaping Our Future.
What would you say about data storage with a lifetime of billions of years? I’d say that extraordinary claims require extraordinary support. The University of Southampton’s Optoelectronics Research Center says it’s developed digital storage that will last for 13.8 billion years at 190° C — or at least that’s how it came out in the report. Peter Kazansky says “we have created the first document which will likely survive the human race.” (And the death of the Sun?)