For years I wrote most of the code for JHOVE. With each format, I wrote tests for whether a file is “well-formed” and “valid.” With most formats, I never knew exactly what these terms meant. They come from XML, where they have clear meanings. A well-formed XML file has correct syntax. Angle brackets and quote marks match. Closing tags match opening tags. A valid file is well-formed and follows its schema. A file can be well-formed but not valid, but it can’t be valid without being well-formed.
With most other formats, there’s no definition of these terms. JHOVE applies them anyway. (I wrote the code, but I didn’t design JHOVE’s architecture. Not my fault.) I approached them by treating “well-formed” as meaning syntactically correct, and “valid” as meaning semantically correct. Drawing the line wasn’t always easy. If a required date field is missing, is the file not well-formed or just not valid? What if the date is supposed to be in ISO 8601 format but isn’t? How much does it matter?
It’s been too long since I’ve had a special discount on FTL. For all of June, you can get Files that Last: Digital Preservation for Everygeek on Smashwords for just $4.00. That’s half off the regular price! The coupon code is KC49Z.
FTL is aimed at anyone with a moderate level of technical knowledge who’s concerned with keeping files from becoming useless over the years. It covers formats, metadata, media, file systems, and more.
The book is 100% DRM-free on Smashwords. I’ve done my best to keep it that way when it’s sold through other platforms but can’t always guarantee it.
What’s the format of a Google Docs file? The question may not even be meaningful. According to Jenny Mitcham at the University of York, there is no such thing as a Google Docs file. What you see when you open a document is an assembly of information from a database. You can export it in various file formats, but the exported file isn’t identical to the Google document.
This makes them risky from a preservation standpoint. You can’t save a local backup of a document. If you lose your Google account, or if censorship in your country cuts you off from it, you lose all your documents.
When you offer expert advice on something, such as digital preservation, you have to admit your own errors. I very nearly lost my 2016 tax return. When I tried to open it in TurboTax, the application just did nothing. I hadn’t exported it to a generally usable format. The TurboTax file format is proprietary and opaque.
Today is International Digital Preservation Day.
In honor of the day, I’m offering Files that Last: Digital Preservation for Everygeek on Smashwords at its lowest price ever. Today only, you can get it for $0.99 with the coupon code
AM26N. This is a one-day sale, so get it now if you don’t already have it!
There are new releases of VeraPDF and JHOVE today.
This XKCD cartoon showed up in my Twitter feed more times in one day than any previous one, for reasons that should be obvious.
Is PDF/A a good archival format? Many institutions use it, but it has problems which are inherent in PDF. With PDF/A-3, it has lost some of its focus. A format which can be a container for any kind of content isn’t great for digital preservation.
An article by Marco Klindt of the Zuse Institute Berlin takes a strong position against its suitability, with the title “PDF/A considered harmful for digital preservation.” Carl Wilson at the Open Preservation Foundation has added his own thoughts with “PDF/A and Long Term Preservation.”
The Library of Congress has reorganized its site on file format sustainability and given it a new URL. (The old one redirects there.) A blog entry discusses the change. Relationships among formats are a big part of the site. It’s significant, for instance, that the MP3 encoding and the de facto MP3 file format get separate entries.
My reactions are mixed. When you click “Format Descriptions” on the main page, you get a page titled “Format Description Categories.” The nesting description at the top says you’re in “Format Descriptions as XML.” Eight categories are listed, and two formats plus “All xxx format descriptions” are listed under each category. There’s no obvious reason why those two formats get special prominence, or what the page has to do with XML.
My brief post yesterday on the TI/A initiative provoked a lively discussion on Twitter, mostly on whether archival formats should allow compression. The argument against compression rests on the argument that archives should be able to deal with files that have a few bit errors in them. This is a badly mistaken idea.