Tag Archives: preservation

When is a PDF not a PDF?

Yesterday I was doing some experiments with Adobe Illustrator. According to some web sites, The CS5 version saves its files as PDF, though with the extension .AI. When you save a file, though, the options dialog has a checkbox labeled “Create PDF Compatible File.” I unchecked it and saved the file, then opened it in JHOVE. JHOVE says it’s perfectly good PDF — indeed, PDF/A. Then I tried opening it in Preview, and this is what it looked like:

File says over and over that it was saved without PDF content

If you don’t actually look at the file but trust the mere fact that it’s a PDF, you might put it into a repository and find out later on that it’s worthless as a PDF. What’s happening is that PDF can embed any kind of content, and this one embeds its native PGF data. Any PDF reader can open the file, but only an application that understands PGF can use its actual content. Anyone putting PDF into a repository should be aware of this risk.

It’s outside the scope of JHOVE to check whether embedded content is acceptable to PDF/A, so the claim that it’s correct PDF/A is probably spurious. It is, however, definitely legal PDF.

This type of situation helps to show why PDF/A-3 is a bad idea.

JHOVE 1.9

I’ve put up JHOVE 1.9 on the SourceForge site today. I think it’s the
least buggy version ever. Please let me know if I’m wrong.

Release notes:

GENERAL

  1. Jhove.java and JhoveView.java now get their version information from
    JhoveBase.java. Before it was redundantly kept in three places, and
    sometimes they didn’t all get updated for a new release. Like in 1.8.
  2. ConfigWriter was in the package edu.harvard.hul.ois.jhove.viewer, which
    caused a NoClassDefFoundError if non-GUI configurations didn’t include
    JhoveViewer.jar in the classpath. It’s been moved to
    edu.harvard.hul.ois.jhove.
  3. Added script packagejhove.sh and made md5.pl part of the CVS repository
    to make packaging for delivery easier.
  4. jhove.bat now simply uses the Java command rather than requiring
    the user to set up the Java path.
  5. JhoveView.jar and jhove (the top level shell script) are now forced
    by ant to be executable so there are no mistakes.
  6. Warning message given on invalid buffer size string, and minimum
    buffer size is 1024.
  7. Configuration file code for adding handlers and giving init strings
    to modules was an awful mess that never could have worked. Major repairs done.

AIFF MODULE

  1. If an AIFF file was found to be little-endian, the module instance
    would stay in little-endian mode for all subsequent files. This
    has been fixed.

TIFF MODULE

  1. TIFF files that had strip or tile offsets but no corresponding byte
    counts were throwing an exception all the way to the top level. Now
    they’re correctly being reported as invalid.

XML MODULE

  1. Cleaned up reporting of schemas, Added some small classes to replace
    the use of string arrays for information structures. Made URI comparison
    for local schema parameter case-independent. Resolved conflict between
    “s” and “schema” parameters.

WAVE MODULE

  1. Some uncaught exceptions caused the module to throw all the way
    back to JhoveBase and not report any result for certain defective
    files. These now report the file as not well-formed.

Digital preservation song

My daily update on the Files that Last blog includes a new song about digital preservation. It’s to promote my Kickstarter campaign for Files that Last and shares the book’s title, but you might find it fun in its own right. Naturally there’s a WAVE file in addition to the MP3. Links are appreciated.

Kickstarter launch: Files That Last

It’s started! Today I’m launching a Kickstarter campaign to help fund the completion and publication of my e-book, Files That Last. Rather than repeat everything I’ve said on the Kickstarter page and the homepage for the book, I’ll say just enough to convince you, as someone who cares about formats and digital preservation, that it’s worth looking at those pages and considering helping to fund the book and spread the word.

Files That Last logoSo far there isn’t, as far as I know, a book to promote and explain digital preservation to people who understand computers but aren’t part of the library and archiving world. That’s where I’m aiming this book. If you look at the Library of Congress’s personal archiving pages, that gives you some idea of what I’m aiming at, though I’m also addressing nonprofit organizations and businesses. It’s not a book for programmers, but it will have enough technical detail to give an understanding of how formats, metadata, and media affect the longevity of files and how to make best use of them.

If you pledge $10, you’ll get an electronic copy of the book when it’s done (DRM-free, naturally). For just $100, you can use it as a classroom text and distribute it to up to 50 students!

If you want brief, regular updates on the project, add this URL to your RSS feed.

I’m counting on your support to help make this happen, whether you pledge money, spread the word, or both. I’m excited about getting the book out, and I think you will be too when you see it.

The disappearing format blues

Old formats sometimes fade into obscurity and can no longer be supported, even if they come from a big company like Microsoft. Chris Rusbridge has noted that Microsoft’s Open Specifications page only goes as far back as Office 97, and that PowerPoint 4.0 files can’t be opened with today’s Microsoft Office. Tony Hey at Microsoft has replied. (Hey is vice president of Microsoft Research Connections). The response was encouraging, particularly in suggesting that Microsoft might “participate in a ‘crowd source’ project working with archivists to create a public spec of these old file formats.”

There’s usually some kind of software around that can read old formats. A search turns doesn’t turn up a lot; there’s something called PowerPressed, which will wrap old PowerPoint files in a .exe application. It looks as if it should run on current Windows systems, but all I know is what that page says.

The situation shows the risk of using a format that isn’t publicly documented. Today this is less of a problem. I think it’s been shown that publishing format specs doesn’t lead to cannibalization of sales by competing software; the company that created the spec is in a position to produce the best implementation. The description of PDF is fully public, and Adobe still dominates the market for PostScript software. Publishing the spec has just made the pie bigger. There’s still quite a lot of software that uses unpublished proprietary specs, though, and it’s risky to rely on the long-term reliability of the files they produce.

Expanding JHOVE

There are some significant improvements I’d like to make to JHOVE, to bring it up to date and improve its availability. The most important of these is to bring the PDF module up to version 1.7 (ISO 32000). I’ve done two releases since leaving Harvard, and download figures and feedback show there’s still significant interest. I’ve done that much to enhance my reputation, but I need to earn a living, and the PDF upgrade would be two or three weeks of solid work, so it has to be contingent on my getting compensated.

Features which look most important for JHOVE’s usual purposes include enhancements to Tagged PDF, Unicode file name references, new markup features, and dictionaries which support 3D artwork. I’m guessing there’s also interest in supporting PDF/A-2 and 3.

There’s probably no one institution right now willing to pay for the effort, but if it were possible to get a few hundred dollars from each of several institutions, it could work. One thought, of course, is Kickstarter, but I don’t know if institutional money can be funneled that way. Maybe it can and I just don’t know it. Alternatively, I can write application letters to the appropriate places, saying that I’ll do it if the amount pledged exceeds a certain threshold. No doubt it would take months for this to happen, but it seems possible in principle.

The idea could even be generalized to a library consortium for funding useful open source projects in return for support. Yes, I’m obviously thinking of how I can make money and I’m not apologizing for it. But the idea really could be useful. The SQLite consortium is a similar approach, focused on a single product.

Does anyone know of similar funding models that have worked, or alternative approaches that would achieve the result? Does the idea make sense or am I just blowing hot air?

Embracing the chaos of formats

We often think of formats in terms of specifications and standards, and this can be a useful thing. If you want to know exactly what the Latin-1 encoding is, you can look at the ISO-8859-1 standard and it will tell you. However, this isn’t always a reliable guide to what’s out there. Someone noticed that ISO-8859 reserves lots of control codes that are rarely used and put additional printing characters there. This got codified as well, as Windows 1252 (which Microsoft falsely claims as an ANSI standard), but there are many ad hoc or obscure encodings which are hard or impossible to find references for.

Earth’s official authorities refused to grant the Klingons a place in Unicode for their characters; nonetheless, there is an unofficial registry that uses part of the Unicode Private Use Area for Klingon and other constructed scripts. Is it official Unicode? No. If you use code points F8D0-F8FF, will others recognize them as Klingon characters? Sometimes.

I’ve written about the TIFF situation before. The TIFF 6.0 spec is an insufficient guide to today’s real-life TIFF. You have to go through scattered tech notes to understand how it’s really used.

Understanding situations like these requires understanding that formats don’t flow unchanged from the minds of their designers to their implementation in the world’s computers. People change things to meet their needs. This makes them more useful for some purposes; at the same time, it makes them more confusing. The only alternative would be to create a format police force with the power to arrest and punish innovators.

The situation is analogous to natural language. You can insist that anything that disagrees with the grammar books is wrong, but if everybody talks that way, there ain’t no stoppin’ it. At the same time, the grammar books put a brake on unnecessary change, keeping the language from breaking down into a thousand mutually unintelligible dialects.

Digital preservationists have to look at the actual usage of formats, not just their official specifications. This doesn’t mean that they should accept every deviation, but they need to acknowledge changes that have become de facto standards. Context matters; an archive of ninteenth-century literature doesn’t have to be concerned with Klingon characters, but an archive of science fiction fan literature had better take them into account. Even an occasional scholarly paper might have a word or two in the pIqaD script.

This proliferation of variants is a big part of why centralized registries of format information don’t work. Not only is there too much information, it keeps changing. The best we can hope for is a coordinated way of finding our way through a chaotic body of information.

JHOVE 1.8

I hadn’t heard any bug reports since 1.8 beta, which hopefully means it’s working smoothly for everyone, so I’ve now released JHOVE 1.8. Let me know ASAP if anything’s broken.

Release notes:

GENERAL

1. If JHOVE doesn’t find a configuration file, it creates a default one.

2. Generics widely added to clean up the code.

3. build.xml files fixed to force compilation to Java 1.5.

4. Shell script “jhove” no longer makes you figure out where JAVA_HOME is.

PDF MODULE

1. Several errors in checking for PDF-A compliance were corrected. Aside from fixing some outright bugs, the Contents key for non-text Annotations is no longer checked, as its presence is only recommended and not required.

2. Improved code by Hökan Svenson is now used for finding the trailer.

TIFF MODULE

1. TIFF tag 700 (XMP) now accepts field type 7 (UNDEFINED) as well as 1 (BYTE), on the basis of Adobe’s XMP spec, part 3.

2. If compression scheme 6 is used in a file, an InfoMessage will report that the file uses deprecated compression.

WAVE MODULE

1. The Originator Reference property, found in the Broadcast Wave Extension (BEXT) chunk, is now reported.

“Just solve the problem” month begins

Today is the start of a month which some digital preservationists have declared “Just Solve the Problem” month. I’ve already expressed a mixture of skepticism and hope for this; throwing resources pell-mell at a computer problem rarely works, but some good is bound to come of the effort. We will not come out of November with “the problem” solved, but there will be new resources, such as this page of links to format information. (This blog is included in the list.)

I’m working on a list of plain text formats, expanding on my earlier post on the subject. This will appear on garymcgath.com, hopefully within the next week. Also, I’ve started a page on the wiki on tools, with a relevant subset of the list on my own site, restricted to locally runnable applications.

Between this and the CURATEcamp hackathon on November 16, lots of interesting stuff is happening in preservation this month.

“Just solve the problem”

Running concurrently with National Novel Writing Month (aka NaNoWriMo) is “Just Solve the Problem,” an effort to get lots of people to attack the “formats problem” for 30 days.

Here’s “the problem,” slightly expurgated to avoid triggering nannyware:

In the last couple centuries, we’ve created a number of self-encapsulated data sets, or “files”. Be they letters, programs, tapes, stamped foil, piano rolls, you name it. And while many of those data sets are self- evident, a ****-ton are not. They’re obscure. They’re weird. And worst of all, many of them are the vital link to scores of historical information.

First thought: That’s not a statement of a solvable problem. It’s a statement of a situation which gives rise to many different problems. Still, throwing in some of my efforts can lead to professional contacts and maybe even a paying contract, and it’s the kind of thing I’d be doing anyway, so I’ve signed up for the wiki.

Extra points to anyone who can write a novel about the formats problem in 30 days.