PDF/A-2 ratified

This time it’s from the PDF/A Competence Center, so I’m pretty sure it’s real: On November 30, the committee for ISO 19005 met in Ottawa and ratified Part 2 of IDO 19005, aka PDF/A. PDF/A is a restricted profile for PDF which is designed to guarantee long-term usability of conforming files.

The previous version, PDF/A-1, was based on PDF 1.4. This is based on ISO 32000-1, which is equivalent to PDF 1.7. Valid PDF/A-1 files are also valid under PDF/A-2.

ISO 19005:2005, or PDF/A-1, is available for purchase from ISO, but as of this writing the new one, which presumably will be ISO 19005:2010, isn’t being offered online yet.

I can’t make any promises about when JHOVE will support PDF/A-2, if ever. Any work I do on it is on my own time. Of course, if someone else wants to run with it, the source is there and I can answer questions.

Misadventures in XML

Around 6 PM yesterday, our SMIL file delivery broke. At first I figured it for a database connection problem, but the log entries were atypical. I soon determined that retrieval of the SMIL DTD was regularly failing. Most requests would get an error, and those that did succeed took over a minute.

There’s a basic flaw in XML DTD’s and schemas (collectively called grammars). They’re identified by a URL, and by default any parser that validates documents by their grammar retrieves it from that URL. For popular ones, that means a lot of traffic. We’ve run into that problem with the JHOVE configuration schema, and that’s nowhere near the traffic a really popular schema must generate.

Knowing this, and also knowing that depending on an outside website’s staying up is a bad idea, we’ve made our own local copy of the SMIL DTD to reference. So I was extremely puzzled about why access to it had become so terrible. After much headscratching, I discovered a bug in the code that kept the redirection to the local DTD from working; we had been going to the official URL, which lives on w3.org, all along.

Presumably W3C is constantly hammered by requests for grammars which it originates, and presumably it’s fighting back by greatly lowering the priority of the worst offenders. Its server wasn’t blocking the requests altogether; that would have been easier to diagnose. The priority just got so low that most requests timed out.

Once I figured that out, I put in the fix to access the local DTD URL, and things are looking nicer now. Moving the fix to production will take a couple of days but should be routine.

The problem is inherent in XML: The definition of grammars is tied to a specific Web location. Aside from the problem of heavy traffic to there, this means the longevity of the grammar is tied to the longevity of the URL. It takes extra effort to make a local copy, and anyone starting out isn’t likely to encounter throttling right away, so the law of least effort says most people won’t bother to.

This got me wondering, as I started writing this post, why don’t parsers like Xerces cache grammars? It turns out that Xerces can cache grammars, though by default it doesn’t. As far as I can tell, this isn’t a well-known feature, and again the law of least effort implies that a lot of developers won’t take advantage of it. But it looks like a very useful thing. It should really be enabled by default, though I can understand why its implementers took the more cautious approach.

JHOVE2 goes to beta

The JHOVE2 team has announced a beta release:

This beta code release supports all the major technical objectives of the project, including a more sophisticated, modular architecture; signature-based file identification; policy-based assessment of objects; recursive characterization of objects comprising aggregate files and files arbitrarily nested in containers; and extensive configuration and reporting options. The release also continues to fill out the roster of supported formats, with modules for ICC color profiles, SGML, Shapefile, TIFF, UTF-8, WAVE, and XML.

The source code page provides the source as a Mercurial repository, or as a single download. The gzip download expands into a file called main-14e8a6102f63 and it isn’t at all obvious what to do with it. Chmoding it to an executable and running it doesn’t work. I’ve asked what this is supposed to be; I’ll update this post when I get a response.

Update: That’s a tarball. Adding the .tar extension and using tar -xvf works nicely.

ZIP standardization

The ZIP format is widely used, both by itself and as part of other widely used formats such as ODF, yet it’s never been standardized. Caroline Arms of the Library of Congress has informed the JHOVE2 list that there’s a new study group under ISO/IEC JTC1 SC34 WG1, which is looking into the standardization of ZIP. There is a Wiki for this study as well as a mailing list archive.

Membership in the group requires going through the appropriate national standards group.

Emoji

I was a little amazed and very amused to see that one of the new features of Unicode 6.0, released just last month, is the Emoji symbol set, which is reported to be widely used on Japanese cell phones. These whimsical symbols must open all kinds of possibilities for text messaging.

Unicode may not officially include Klingon characters, but it can still allow for fun.

PDF/A-2

PDF/A-2, according to a news item from Luratech, has been finalized and will be published as a standard in early 2011. (But see the comments.) Some more information (PDF) is available from the PDF/A Competence Center. PDF/A-1, which is based on PDF 1.4, will continue to remain a valid standard. PDF/A-2 is based on ISO 32000-1, aka PDF 1.7.

CSS3: Threat or menace?

Lately I’ve been looking at CSS3 animations as a possible solution to a problem I’ve been dealing with. But after thinking about it, I’m getting more concerned: CSS animations? CSS is supposed to be about the layout of a page, not the creation of special effects. I’ve seen pages describing supposedly wonderful effects that can be created with CSS3. Fine, but what if you don’t want them?

JavaScript and Flash product many annoying effects, introduced by designers who effectively are yelling “Hey, look how clever we are!” at you while you’re trying to concentrate on reading. You can turn off JavaScript and Flash and still get readable content, at least with many sites. But turn off CSS and most modern web pages will turn into a messy jumble. CSS3 looks like a narcissistic web designer’s dream: a way to bombard you with special effects that you just can’t escape from.

If you aren’t worried yet, consider this post on how to do Flash-like ads using only CSS3.

Addendum: The CSS3 working draft was recently updated.

Reinventing the stone tablet

It’s a basic premise of the digital preservation community that preservation will require ongoing effort over the years. Let an archive lie neglected for twenty or thirty years, and you might as well throw it away. No one will know how to plug in that piece of hardware. If they do, it’ll have stopped working. If it still works, its files will be in some long-forgotten format.

The trouble is, this is an untenable requirement over the long run. Institutions disappear. Wars happen. Governments are replaced. Budgets get cut. Projects get dropped. Organizational interests change. The contents of an archive may be deemed heretical or politically inconvenient. The expectation that over a period of centuries, institutions will actively preserve any given archive is a shaky one.

Information from past centuries has survived not by active maintenance, but by luck and durability. Much of the oldest information we have was carved into stone walls and tablets. It lay forgotten for centuries, till someone went digging for it. There were issues with the data format, to be sure; people worked for decades to figure out hieroglyphics and cuneiform, and no one’s cracked Linear A yet. But at least we have the data.

Preservation of digital data over a comparable time span requires storage with similar longevity. This is a very difficult problem. If it’s hard to figure out writing from three thousand years ago, how will people three thousand years from now make any sense of a 21st century storage device? But we have advantages. Global communication means that information doesn’t stay hidden in one corner of the world, where it can be wiped out. Today’s major languages aren’t likely to be totally forgotten. As long as enough information is passed down through each generation to allow deciphering of our stone tablets, people in future centuries will be able to extract their information.

What we don’t have is the tablets. Our best digital media are intended to last for decades, not centuries. Archivists should be looking into technologies that can really last, that will be standardized so that the knowledge of how to read them stands a good chance of surviving.

PDF and accessibility

PDF is both better and worse than its reputation for accessibility. That is, it’s worse than most people realize when it’s used with text-to-speech readers, but potentially much better than many visually impaired people suppose from their own experience. The reason for this paradox is that PDF wasn’t designed to present content rather than appearance, but modern versions have features which largely make up for this.

The worst case, of course, is the scanned document. Not only does this mean you’re stuck with OCR for machine reading, but it isn’t searchable. It’s a cheap solution when working from hardcopy originals, but should be avoided if possible.

Normal PDF has a number of problems. There’s no necessary relationship between the order of elements in a file and the expected reading order. If an article is in multiple columns, the text ordering in the document might go back and forth between columns. If an article is “continued on page 46,” it can be hard to find the continuation.

Character encoding is based on the font, so there’s no general way to tell what character a numeric value represents. The same character may have different encodings within the same document. This means that reader software doesn’t know what to do with non-ASCII characters (and even ASCII isn’t guaranteed).

Adobe provided a fix to this problem with a new feature in PDF 1.4, known as Tagged PDF. All except seriously outdated PDF software supports at least 1.4. This doesn’t mean using it is easy, though. Some software, such as Adobe’s InDesign, supports creation of Tagged PDF files, but you have to remember to turn on the feature, and you may need to edit automatically created tags to reflect your document structure accurately. For some things, it can be a pain. I tried fixing up a songbook in InDesign with PDF tags, and realized I’d need to do a lot of work to get it right.

Tagging defines contiguous groups of text and ordering, offering a fix for the problem of multiple columns, sidebars, and footnotes. It allows language identification, so if you have a paragraph of German in the middle of an English text, the reader can switch languages if it supports them. Character codes in Tagged PDF are required to have an unambiguous mapping to Unicode.

These features of Tagged PDF are obviously valuable to preservation as well as to visual access. PDF/A incorporates Tagged PDF.

It shouldn’t be assumed that because a document is in PDF, all problems with visual access are solved. But solutions are possible, with some extra effort.

Some useful links: