Monthly Archives: December 2012

A preservation hazard in OpenOffice

While playing with OpenOffice in my research for Files that Last, I came across a preservation risk. I copied an image from a website and pasted it into a text document, then looked at the resulting XML. The image data wasn’t anywhere in content.xml or anywhere else in the overall ZIP document. Instead, I found this:


<draw:image
xlink:href="http://plan-b-for-openoffice.org/resources/images/x180x60_3_get.png.pagespeed.ic.fjV0teeVb_.png"
xlink:type="simple"
xlink:show="embed"
xlink:actuate="onLoad"/>

The source for the image is on the Web. This means that if the URL stops working, the document loses the image. That’s a poor plan for long-term storage.

The way to avoid this is to use Edit > Paste special and paste the image as a bitmap. It can be a pain to remember to do this. You may be able to catch images that are pasted by reference, since there can be a brief delay while just a box with the URL is displayed before the image comes up.

Sneaky little preservation hazards like this (and the earlier one mentioned with Adobe Illustrator files) are the kind of thing you’ll find when Files that Last comes out.

JHOVE Tips for Developers: Call for proofreaders

As a practice run for publishing Files that Last on Smashwords, I’ve put together a small but hopefully useful e-booklet, JHOVE Tips for Developers, which I’m planning to put up there on a “choose your own price” basis. This will help me work out the process of creating the book on a small scale, and maybe it will buy me a Whopper and fries.

For a book of this sort I obviously can’t afford paid proofreading, but I’m hoping one or two people might give it a looking over before I submit the book. You can get the draft as a PDF here.

I’d offer you a free copy in return, but you can get that anyway. What I can do is offer people who give useful feedback credit in the book, as well as my personal thanks.

When is a PDF not a PDF?

Yesterday I was doing some experiments with Adobe Illustrator. According to some web sites, The CS5 version saves its files as PDF, though with the extension .AI. When you save a file, though, the options dialog has a checkbox labeled “Create PDF Compatible File.” I unchecked it and saved the file, then opened it in JHOVE. JHOVE says it’s perfectly good PDF — indeed, PDF/A. Then I tried opening it in Preview, and this is what it looked like:

File says over and over that it was saved without PDF content

If you don’t actually look at the file but trust the mere fact that it’s a PDF, you might put it into a repository and find out later on that it’s worthless as a PDF. What’s happening is that PDF can embed any kind of content, and this one embeds its native PGF data. Any PDF reader can open the file, but only an application that understands PGF can use its actual content. Anyone putting PDF into a repository should be aware of this risk.

It’s outside the scope of JHOVE to check whether embedded content is acceptable to PDF/A, so the claim that it’s correct PDF/A is probably spurious. It is, however, definitely legal PDF.

This type of situation helps to show why PDF/A-3 is a bad idea.

JHOVE 1.9

I’ve put up JHOVE 1.9 on the SourceForge site today. I think it’s the
least buggy version ever. Please let me know if I’m wrong.

Release notes:

GENERAL

  1. Jhove.java and JhoveView.java now get their version information from
    JhoveBase.java. Before it was redundantly kept in three places, and
    sometimes they didn’t all get updated for a new release. Like in 1.8.
  2. ConfigWriter was in the package edu.harvard.hul.ois.jhove.viewer, which
    caused a NoClassDefFoundError if non-GUI configurations didn’t include
    JhoveViewer.jar in the classpath. It’s been moved to
    edu.harvard.hul.ois.jhove.
  3. Added script packagejhove.sh and made md5.pl part of the CVS repository
    to make packaging for delivery easier.
  4. jhove.bat now simply uses the Java command rather than requiring
    the user to set up the Java path.
  5. JhoveView.jar and jhove (the top level shell script) are now forced
    by ant to be executable so there are no mistakes.
  6. Warning message given on invalid buffer size string, and minimum
    buffer size is 1024.
  7. Configuration file code for adding handlers and giving init strings
    to modules was an awful mess that never could have worked. Major repairs done.

AIFF MODULE

  1. If an AIFF file was found to be little-endian, the module instance
    would stay in little-endian mode for all subsequent files. This
    has been fixed.

TIFF MODULE

  1. TIFF files that had strip or tile offsets but no corresponding byte
    counts were throwing an exception all the way to the top level. Now
    they’re correctly being reported as invalid.

XML MODULE

  1. Cleaned up reporting of schemas, Added some small classes to replace
    the use of string arrays for information structures. Made URI comparison
    for local schema parameter case-independent. Resolved conflict between
    “s” and “schema” parameters.

WAVE MODULE

  1. Some uncaught exceptions caused the module to throw all the way
    back to JhoveBase and not report any result for certain defective
    files. These now report the file as not well-formed.

Digital preservation song

My daily update on the Files that Last blog includes a new song about digital preservation. It’s to promote my Kickstarter campaign for Files that Last and shares the book’s title, but you might find it fun in its own right. Naturally there’s a WAVE file in addition to the MP3. Links are appreciated.

Kickstarter launch: Files That Last

It’s started! Today I’m launching a Kickstarter campaign to help fund the completion and publication of my e-book, Files That Last. Rather than repeat everything I’ve said on the Kickstarter page and the homepage for the book, I’ll say just enough to convince you, as someone who cares about formats and digital preservation, that it’s worth looking at those pages and considering helping to fund the book and spread the word.

Files That Last logoSo far there isn’t, as far as I know, a book to promote and explain digital preservation to people who understand computers but aren’t part of the library and archiving world. That’s where I’m aiming this book. If you look at the Library of Congress’s personal archiving pages, that gives you some idea of what I’m aiming at, though I’m also addressing nonprofit organizations and businesses. It’s not a book for programmers, but it will have enough technical detail to give an understanding of how formats, metadata, and media affect the longevity of files and how to make best use of them.

If you pledge $10, you’ll get an electronic copy of the book when it’s done (DRM-free, naturally). For just $100, you can use it as a classroom text and distribute it to up to 50 students!

If you want brief, regular updates on the project, add this URL to your RSS feed.

I’m counting on your support to help make this happen, whether you pledge money, spread the word, or both. I’m excited about getting the book out, and I think you will be too when you see it.

And … JHOVE 1.9b3

Lately I’ve been writing a user guide for JHOVE as part of an upcoming
book. This means going through all the features to see how they really
work, and this has turned up a number of bugs. Among the latest fixes
are are: (1) If the AIFF module encounters a little-endian file, it
treats all subsequent files as little-endian whether they are or not.
(2) Certain errors in WAVE files throw an exception from the module
instead of reporting that the file isn’t well-formed. (3) The XML
module’s “s” and “schema” parameters conflicted, with “schema” being
treated as both, and there was a problem with schema URIs with
upper-case characters.

Version 1.9b3 should fix all of these. Hopefully I won’t find anything
else that needs fixing soon, so we can finally have a 1.9 release. but
if there are any problems with this beta, please let me know!

JHOVE 1.9b2

JHOVE 1.9b2 is up, fixing issues with the configuration file. The code for editing the configuration file from the GUI was just completely broken, but I think it’s fixed now. I can’t imagine anyone was ever trying to add init strings to modules (none of the standard ones use one anyway) or add handlers using the GUI, or someone would already have noticed. But I couldn’t stand having it not fixed, so the new build is there.