Tag Archives: PDF

PDF 1.7 and beyond

A paradox from Euan Cochrane: PDF 1.7 may follow the ISO standard, but not all PDF 1.7 files follow the same standard.

PDF/A post on FTL

Today on Files That Last I have a post on “PDF/A for the long haul.” It’s directed at the end user or administrator, not at the formats geek or preservation specialist, but might be useful to link to when you’re explaining what PDF/A is good for.

LOC irony

The Library of Congress Digital Preservation Newsletter (latest issue, subscription page) has some very nice content, but it’s ironic that the newsletter is delivered with the nondescript file name of 201101.pdf and that (if JHOVE is right) it doesn’t conform to PDF-A. A PDF/A document can’t have external links, so its lack is excusable; it’s the meaningless file name that actually bugs me more from a preservation standpoint.

I can’t find an editorial contact address on the newsletter to mention this to.

PDF/A-2 ratified

This time it’s from the PDF/A Competence Center, so I’m pretty sure it’s real: On November 30, the committee for ISO 19005 met in Ottawa and ratified Part 2 of IDO 19005, aka PDF/A. PDF/A is a restricted profile for PDF which is designed to guarantee long-term usability of conforming files.

The previous version, PDF/A-1, was based on PDF 1.4. This is based on ISO 32000-1, which is equivalent to PDF 1.7. Valid PDF/A-1 files are also valid under PDF/A-2.

ISO 19005:2005, or PDF/A-1, is available for purchase from ISO, but as of this writing the new one, which presumably will be ISO 19005:2010, isn’t being offered online yet.

I can’t make any promises about when JHOVE will support PDF/A-2, if ever. Any work I do on it is on my own time. Of course, if someone else wants to run with it, the source is there and I can answer questions.

PDF/A-2

PDF/A-2, according to a news item from Luratech, has been finalized and will be published as a standard in early 2011. (But see the comments.) Some more information (PDF) is available from the PDF/A Competence Center. PDF/A-1, which is based on PDF 1.4, will continue to remain a valid standard. PDF/A-2 is based on ISO 32000-1, aka PDF 1.7.

PDF and accessibility

PDF is both better and worse than its reputation for accessibility. That is, it’s worse than most people realize when it’s used with text-to-speech readers, but potentially much better than many visually impaired people suppose from their own experience. The reason for this paradox is that PDF wasn’t designed to present content rather than appearance, but modern versions have features which largely make up for this.

The worst case, of course, is the scanned document. Not only does this mean you’re stuck with OCR for machine reading, but it isn’t searchable. It’s a cheap solution when working from hardcopy originals, but should be avoided if possible.

Normal PDF has a number of problems. There’s no necessary relationship between the order of elements in a file and the expected reading order. If an article is in multiple columns, the text ordering in the document might go back and forth between columns. If an article is “continued on page 46,” it can be hard to find the continuation.

Character encoding is based on the font, so there’s no general way to tell what character a numeric value represents. The same character may have different encodings within the same document. This means that reader software doesn’t know what to do with non-ASCII characters (and even ASCII isn’t guaranteed).

Adobe provided a fix to this problem with a new feature in PDF 1.4, known as Tagged PDF. All except seriously outdated PDF software supports at least 1.4. This doesn’t mean using it is easy, though. Some software, such as Adobe’s InDesign, supports creation of Tagged PDF files, but you have to remember to turn on the feature, and you may need to edit automatically created tags to reflect your document structure accurately. For some things, it can be a pain. I tried fixing up a songbook in InDesign with PDF tags, and realized I’d need to do a lot of work to get it right.

Tagging defines contiguous groups of text and ordering, offering a fix for the problem of multiple columns, sidebars, and footnotes. It allows language identification, so if you have a paragraph of German in the middle of an English text, the reader can switch languages if it supports them. Character codes in Tagged PDF are required to have an unambiguous mapping to Unicode.

These features of Tagged PDF are obviously valuable to preservation as well as to visual access. PDF/A incorporates Tagged PDF.

It shouldn’t be assumed that because a document is in PDF, all problems with visual access are solved. But solutions are possible, with some extra effort.

Some useful links:

PDF exploit

A number of web sites are talking about a vulnerability in PDF. So far I haven’t found an exact description; anyone who explained it in detail would get the blame for everyone who uses it for malicious purposes. But the idea seems simple enough that anyone with the necessary technical knowledge (including me) could work it out given a little time. Apparently it’s a means by which the user can be presented with a legitimate-looking dialog and tricked into approving the launching of arbitrary executable code. The exploit can be added to an existing PDF without changing its appearance. JavaScript isn’t required. The vulnerability is in the format specification, not in a software bug. This is the really nasty kind of vulnerability that designers have nightmares about.

Here’s an article on CNET on the issue. There seems to be substantive discussion of the root of the problem here. I’ve got to get to work now. I’ll post something more later.


 
Update: OK, it’s not so bad as it sounded. Here’s the real account, which doesn’t say exactly how to do it, but gives enough clues that it’s not too hard to figure out the rest.

As you might have guessed if you know PDF, it uses the PDF Launch Action. The PDF specification actually doesn’t mandate any safety features in the Launch Action; if you implemented a PDF reader that automatically launched anything a PDF document told you to, you’d be within the spec. But Adobe Reader, exercising normal common sense, prompts the user for permission to launch. The trick is just that the text which describes the application to be launched can be modified. The user still gets a stern warning not to launch anything untrusted.

This trick will doubtless catch some people, as even simpler tricks do (just saying “don’t worry, it’s safe” in the document itself will trick a rather large number of fools). But it isn’t really anything to get hugely worried about.

PDF/A Seminar in Washington

A seminar on PDF/A will be held in Washington, DC, on March 26. The registration fee is $125. PDF/A is a restricted subset of PDF designed to promote long-term data viability for the purpose of preservation.

The press release contains a bizarre statement:

“At this time, the use of PDF/A is not mandatory in the United States,” said Betsy Fanning, Director, Standards and Member Services, AIIM, “however, that is changing.” “We are learning of draft legislation that is being debated that will make the use of PDF/A mandatory for preserving electronic documents.”

Congress has neither the right nor the technical competence to order us to use particular file formats. Hopefully this was an out-of-context quote about the government’s own use of PDF/A, though even there legislation requiring a specific subset of a specific format would be very strange.