Tag Archives: PDF

McCoy on the future of PDF

Bill McCoy’s article, “Takeaways on the Future of Documents: Report from the 2015 PDF Technical Conference,” offers some interesting thoughts on the future of PDF. I can’t find much to disagree with. PDF is in practice a format for reproducing a specific document appearance, and that’s becoming less important as the variety of computing devices increases. He makes a point I hadn’t thought of, that the “de facto interoperable PDF format” is well behind the latest specifications, which may explain why I haven’t seen complaints that JHOVE doesn’t know about ISO 32000 PDF!
Continue reading

PDF forever?

Distant galaxiesThe PDF Association has an article on its site titled “What’s unique about PDF? and why PDF will live forever.” The article claims PDF is “a format of such flexibility and power that it will define the essential ‘electronic document’ concept forever.”

Forever is a long time. No one will think they mean that the last object left as the universe succumbs to entropy will be a disk with a PDF file, but what scale of “forever” gives sense to their claim? In a tweet responding to my skepticism, they offered a clarification:

Continue reading

PDF 2.0

As most people who read this blog know, the development of PDF didn’t end with the ISO 32000 (aka PDF 1.7) specification. Adobe has published three extensions to the specification. These aren’t called PDF 1.8, but they amount to a post-ISO version.

The ISO TC 171/SC 2 technical committee is working on what will be called PDF 2.0. The jump in major revision number reflects the change in how releases are being managed but doesn’t seem to portend huge changes in the format. PDF is no longer just an Adobe product, though the company is still heavily involved in the spec’s continued development. According to the PDF Association, the biggest task right now is removing ambiguities. The specification’s language will shift from describing conforming readers and writers to describing a valid file. This certainly sounds like an improvement. The article mentions that several sections have been completely rewritten and reorganized. What’s interesting is that their chapter numbers have all been incremented by 4 over the PDF 1.7 specification. We can wonder what the four new chapters are.

Leonard Rosenthol gave a presentation on PDF 2.0 in 2013.

As with many complicated projects, PDF 2.0 has fallen behind its original schedule, which expected publication in 2013. The current target for publication is the middle of 2016.

veraPDF validator

The veraPDF Consortium has announced a public prototype of its PDF validation software.

It’s ultimately intended to be “the definitive open source, file-format validator for all parts and conformance levels of ISO 19005 (PDF/A)”; however, it’s “currently more a proof of concept than a usable file format validator.”

New open-source file validation project

The VeraPDF Consortium has announced that it has begun the prototyping phase for a new open-source validator of PDF/A. This is a piece of the PREFORMA (PREservation FORMAts) project; other branches will cover TIFF and audio-visual formats. Participants in VeraPDF are the Open Preservation Foundation, the PDF Association, the Digital Preservation Coalition, Dual Lab, and Keep Solutions.

Documents are available, including a functional and technical specification. It aims at being the “definitive” tool for determining if a PDF document conforms to the ISO 19005 requirements. It will separate the PDF parser from the higher-level validation, so a different parser can be plugged in.

Validating PDF is tough In JHOVE, I designed PDF/A validation as an afterthought to the PDF module. PDF/A requirements affect every level of the implementation, so that approach led to problems that never entirely went away. Making PDF/A validation a primary goal should help greatly, but having it sit on top of and independent from the PDF parser may introduce another form of the same problem.

PDF files can include components which are outside the spec, and PDF/A-3 permits their inclusion. This means that really validating PDF/A-3 is an open-ended task. Even in the earlier version of PDF/A, not everything that can be put into a file is covered by the PDF specification per se. The specification addresses this by providing for extensibility; add-ons can address these aspects as desired. In particular, the core validator won’t attempt thorough validation of fonts.

A Metadata Fixer will not just check documents for conformance, but in some cases will perform the necessary fixes to make a file PDF/A compliant.

JHOVE ignores the content streams, focusing only on the structure, so it could report a thoroughly broken file as well-formed and valid. JHOVE2 doesn’t list PDF in its modules. Analyzing the content stream data is a big task. In general, the project looks hugely ambitious, and not every ambitious digital preservation project has reached a successful end. If this one does, it will be a wonderful accomplishment.

Article on PDF/A validation with JHOVE

An article by Yvonne Friese does a good job of explaining the limitations of JHOVE in validating PDF/A. At the time that I wrote JHOVE, I wasn’t aware how few people had managed to write a PDF validator independent of Adobe’s code base; if I’d known, I might have been more intimidated. It’s a complex job, and adding PDF/A validation as an afterthought added to the problems. JHOVE validates only the file structure, not the content streams, so it can miss errors that make a file unusable. Finally, I’ve never updated JHOVE to PDF 1.7, so it doesn’t address PDF/A-2 or 3.

I do find the article flattering; it’s nice to know that even after all these years, “many memory institutions use JHOVE’s PDF module on a daily basis for digital long term archiving.” The Open Preservation Foundation is picking up JHOVE, and perhaps it will provide some badly needed updates.

The uses and abuses of PDF

PDF is a versatile format, but that doesn’t mean it should be used for everything. It’s a visual presentation format above all else. It lets you define a document with a specific appearance, with capabilities such as form filling and text searching. It’s not very good if you want a document that adapts to different device capabilities. If you need an editable format or a way to deliver structured data, there are much better alternatives.

When the Malaysian government released satellite data from the communications of Flight 370, which had disappeared, it delivered a PDF file. It looks very nice, but anyone who wants to extract and analyze the data has to do a lot of extra work. A spreadsheet or structured text (e.g., CSV) document is the right thing.

PDF can be used for e-books, but it’s not ideal. If you create normal sized pages, then on a phone they’ll either look tiny or require a lot of scrolling. Formats such as EPUB work better with a range of screen sizes.

Delivering text documents in PDF loses a lot of its value when the document is a scanned image rather than a text-based document. It can’t be searched, and people with visual disabilities can’t use text-to-speech. My condo association delivers its newsletters in scanned-image PDF. When I pointed out these problems at an owners’ meeting, I was told that the owners weren’t sophisticated enough to take advantage of those benefits. Our complex is a big one, and I’d be surprised if at least a few residents don’t use text-to-speech when they can. It’s not particularly hard to generate PDF files; scanning a finished document into a PDF seems like the hard way.

To maximize the usefulness of assistive technologies, you should create PDF/A if possible. It produces a slightly larger file, but it’s organized in a way that makes extraction of content easier and eliminates dependencies you might not have thought of.

Redacting PDFs is another tricky issue. If you simply black out an area, that’s the equivalent of gluing a piece of paper over it, and no harder to defeat. For advice on properly redacting documents, who better to turn to than the NSA? They may be a gang of criminals within the government, but they certainly know how to redact. It’s from 2006, though, so some of its advice could be dated.

There are lots of things you can do with PDF, but use it intelligently and where it’s appropriate.

A PDF question

A while back, I posted a question on superuser.com about a PDF issue that’s causing problems in JHOVE. So far it hasn’t gotten any answers, so I’m signal-boosting my own question here. Here’s what I asked:

The JHOVE parser for PDF, which I maintain, will sometimes find a non-dictionary object in a PDF’s Annots array. According to section 8.4.1 of the PDF spec, the Annots array holds “an array of annotation dictionaries.” In the case that I’m looking at right now, there’s a keyword of “Annot” instead of a dictionary. Is this an invalid PDF file, or is there a subtlety in the spec which I’ve overlooked?

Answering on stackoverflow.com is best, so other people can see the answer, but if you prefer to answer here, I’ll post or summarize any useful response, with attribution, as an answer over there.

When is a PDF not a PDF?

Yesterday I was doing some experiments with Adobe Illustrator. According to some web sites, The CS5 version saves its files as PDF, though with the extension .AI. When you save a file, though, the options dialog has a checkbox labeled “Create PDF Compatible File.” I unchecked it and saved the file, then opened it in JHOVE. JHOVE says it’s perfectly good PDF — indeed, PDF/A. Then I tried opening it in Preview, and this is what it looked like:

File says over and over that it was saved without PDF content

If you don’t actually look at the file but trust the mere fact that it’s a PDF, you might put it into a repository and find out later on that it’s worthless as a PDF. What’s happening is that PDF can embed any kind of content, and this one embeds its native PGF data. Any PDF reader can open the file, but only an application that understands PGF can use its actual content. Anyone putting PDF into a repository should be aware of this risk.

It’s outside the scope of JHOVE to check whether embedded content is acceptable to PDF/A, so the claim that it’s correct PDF/A is probably spurious. It is, however, definitely legal PDF.

This type of situation helps to show why PDF/A-3 is a bad idea.

PDF/A-3

The latest version of PDF/A, a subset of PDF suitable for long-term archiving, is now available as ISO standard 19005-3:2012. According to the PDF/A Association Newsletter, “there is only one new feature with PDF/A-3, namely that any source format can be embedded in a PDF/A file.”

This strikes me as a really bad idea. The whole point of PDF/A is to restrict content to a known, self-contained set of options. The new version provides a back door that allows literally anything. The intent, according to the article, is to let archivists save documents in their original format as well as their PDF representation. Certainly saving the originals is a good archiving practice, but it should be done in an archival package, not in a PDF format designed for archiving.

Mission creep afflicts projects of all kinds, and this is a case in point.