Monthly Archives: September 2010

Reinventing the stone tablet

It’s a basic premise of the digital preservation community that preservation will require ongoing effort over the years. Let an archive lie neglected for twenty or thirty years, and you might as well throw it away. No one will know how to plug in that piece of hardware. If they do, it’ll have stopped working. If it still works, its files will be in some long-forgotten format.

The trouble is, this is an untenable requirement over the long run. Institutions disappear. Wars happen. Governments are replaced. Budgets get cut. Projects get dropped. Organizational interests change. The contents of an archive may be deemed heretical or politically inconvenient. The expectation that over a period of centuries, institutions will actively preserve any given archive is a shaky one.

Information from past centuries has survived not by active maintenance, but by luck and durability. Much of the oldest information we have was carved into stone walls and tablets. It lay forgotten for centuries, till someone went digging for it. There were issues with the data format, to be sure; people worked for decades to figure out hieroglyphics and cuneiform, and no one’s cracked Linear A yet. But at least we have the data.

Preservation of digital data over a comparable time span requires storage with similar longevity. This is a very difficult problem. If it’s hard to figure out writing from three thousand years ago, how will people three thousand years from now make any sense of a 21st century storage device? But we have advantages. Global communication means that information doesn’t stay hidden in one corner of the world, where it can be wiped out. Today’s major languages aren’t likely to be totally forgotten. As long as enough information is passed down through each generation to allow deciphering of our stone tablets, people in future centuries will be able to extract their information.

What we don’t have is the tablets. Our best digital media are intended to last for decades, not centuries. Archivists should be looking into technologies that can really last, that will be standardized so that the knowledge of how to read them stands a good chance of surviving.

PDF and accessibility

PDF is both better and worse than its reputation for accessibility. That is, it’s worse than most people realize when it’s used with text-to-speech readers, but potentially much better than many visually impaired people suppose from their own experience. The reason for this paradox is that PDF wasn’t designed to present content rather than appearance, but modern versions have features which largely make up for this.

The worst case, of course, is the scanned document. Not only does this mean you’re stuck with OCR for machine reading, but it isn’t searchable. It’s a cheap solution when working from hardcopy originals, but should be avoided if possible.

Normal PDF has a number of problems. There’s no necessary relationship between the order of elements in a file and the expected reading order. If an article is in multiple columns, the text ordering in the document might go back and forth between columns. If an article is “continued on page 46,” it can be hard to find the continuation.

Character encoding is based on the font, so there’s no general way to tell what character a numeric value represents. The same character may have different encodings within the same document. This means that reader software doesn’t know what to do with non-ASCII characters (and even ASCII isn’t guaranteed).

Adobe provided a fix to this problem with a new feature in PDF 1.4, known as Tagged PDF. All except seriously outdated PDF software supports at least 1.4. This doesn’t mean using it is easy, though. Some software, such as Adobe’s InDesign, supports creation of Tagged PDF files, but you have to remember to turn on the feature, and you may need to edit automatically created tags to reflect your document structure accurately. For some things, it can be a pain. I tried fixing up a songbook in InDesign with PDF tags, and realized I’d need to do a lot of work to get it right.

Tagging defines contiguous groups of text and ordering, offering a fix for the problem of multiple columns, sidebars, and footnotes. It allows language identification, so if you have a paragraph of German in the middle of an English text, the reader can switch languages if it supports them. Character codes in Tagged PDF are required to have an unambiguous mapping to Unicode.

These features of Tagged PDF are obviously valuable to preservation as well as to visual access. PDF/A incorporates Tagged PDF.

It shouldn’t be assumed that because a document is in PDF, all problems with visual access are solved. But solutions are possible, with some extra effort.

Some useful links: