In a GitHub comment, Johan van der Knijff noted how messy it is to determine the version of a PDF file. He looked at a file with the header characters “%PDF-1.8”. DROID says this isn’t a PDF file at all.
By a strict reading of the PDF specification, it isn’t. The version number has to be in the range 1.0 through 1.7. Being this strict seems like a bad idea, since it would mean format recognition software will fail to recognize any future versions of the format. (JHOVE doesn’t care what character comes after the period.)
A lot of applications claim they can display PDF files, but not all of them fully support the format. They won’t necessarily display all valid files correctly. The PDF Association has an article discussing this problem, with the main focus on the Microsoft Edge browser.
Edge offers only partial support for the JBIG2Decode and JPXDecode filters, which means some objects might not display. It doesn’t support certain types of shadings, so other objects could render incorrectly.
The strength of PDF is supposed to be that it will render the same way everywhere. You can blame Microsoft for not putting enough work into it, or Adobe for making the format too complex. I have enough experience with it to know it’s a seriously difficult format just to analyze, to say nothing of rendering. Is a format which presents such difficulties really the ideal for a universal document rendering format that people will count on far into the future?
Update: It gets worse. Take a look at this discussion of what’s in PDF.
Posted in News
Tagged Microsoft, PDF
The next big jump in PDF may finally happen this year. The PDF association tells us that the spec for PDF 2.0 is “feature-complete” and will be available to the ISO PDF committee and members of the PDF Association in July. When this will turn into a public release still isn’t clear. A year ago the target was “mid-2016”; that seems unlikely now.
The specification will be ISO 32000-2. The current version of PDF, 1.7, is ISO 32000-1. More precisely, Adobe has published several extension levels to PDF 1.7. They’re a way of getting around having a version 1.8, which would be an admission that the ISO standard is outdated. Version 2.0 will get Adobe and ISO back in sync. Hopefully Adobe will publish the PDF spec for free, as it has in the past, so that it won’t be available just to people who pay for the ISO version. Currently an electronic copy of ISO 32000-1 costs 198 Swiss francs, or a bit more than $200.
Posted in News
Tagged PDF, standards
The PDF Association reminds us that we can use PDF forms for electronic submissions. It’s a useful feature, and I’ve filled out PDF forms now and then. However, one point seems wrong to me:
PDF/A, the archival subset of PDF technology, provides a means of ensuring the quality and usability of conforming PDF pages (including PDF forms) without any external dependencies. PDF/A offers implementers the confidence of knowing that conforming documents and forms will be readable 10, 20 or 200 years from now.
The problem is that PDF/A doesn’t allow form actions. ISO 19005-1 says, “Interactive form fields shall not perform actions of any type.” You can have a form and you can print it, but without being able to perform the
submit-form action, it isn’t useful for digital submissions.
You could have an archival version of the form and a way to convert it to an interactive version, but this seems clumsy. Please let me know if I’ve missed something.
Update: There’s some kind of irony in the fact that the same day that I posted this, I received a print-only PDF form which I’ll now have to take to Staples to fax to the originator.
It must be a surprise to most people, but you can represent three-dimensional objects in PDF, in spite of its strictly 2-dimensional imaging model. It turns out there are two ways to do it, with the older U3D and the more modern PRC. What makes them possible is PDF’s annotation feature, which allows capabilities to be added to PDF, and the Acrobat 3D API. Full support of these features requires implementation of at least PDF 1.7 Extension Level 1, or to put it in application terms, Acrobat 8.1.
The PDF/E standard for engineering documents, aka ISO 24517, includes U3D but not PRC. A PDF/E-2 standard is currently in development and is expected to include PRC. PDF/E, like the other slashes of PDF, is a subset of the PDF standard (version 1.6), so obviously it’s possible to do 3D work without reference to it. It’s intended for cases where long-term retention or archiving is important. This suggests some affinity with PDF/A, which is specifically aimed at archive-quality documents, and the PDF Association, which is heavily involved in PDF/A, has recently started a PDF/E Competence Center. Oddly, the competence center says that PDF/E-1 “does not address 3D,” though other sources say PDF/E does reference U3D. Perhaps this is a matter of what really constitutes “addressing” 3D as opposed to just acknowledging it.
An article from the PDF Association points out the pitfalls in searching PDF documents. Even if a document has actual text in it, rather than being a scanned image, it might not hold the text in the natural character ordering. PDF is a format for rendering a document’s visible appearance, and it isn’t so good at holding semantic content. Chunks of text can be stored out of sequence as long as they render in the right place.
Posted in commentary
3D printing is an exciting new technology, but the formats to choose from are an alphabet soup.
A call for “PDF 2.0” or an “Analytical File Format.” The description is vague, but it sounds like something analogous to the Semantic Web for documents.
BW64, a new RIFF-based audio format. The article describes it as a “3D” format, but more significantly it’s a metadata-rich interchange format that supports really big files.
And just for bitter laughs: I need a ‘file’ format.”
Posted in Links
Tagged 3D, audio, PDF
The PDF Association and TWAIN Working Group have announced a partnership to develop a specification called PDF/Raster or PDF/R. It’s described as “a component of TWG’s TWAIN Direct™ initiative, a language/protocol that eliminates the need for users to install vendor specific drivers as communication between scanning devices and image capture software applications.”
Bill McCoy’s article, “Takeaways on the Future of Documents: Report from the 2015 PDF Technical Conference,” offers some interesting thoughts on the future of PDF. I can’t find much to disagree with. PDF is in practice a format for reproducing a specific document appearance, and that’s becoming less important as the variety of computing devices increases. He makes a point I hadn’t thought of, that the “de facto interoperable PDF format” is well behind the latest specifications, which may explain why I haven’t seen complaints that JHOVE doesn’t know about ISO 32000 PDF!
The PDF Association has an article on its site titled “What’s unique about PDF? and why PDF will live forever.” The article claims PDF is “a format of such flexibility and power that it will define the essential ‘electronic document’ concept forever.”
Forever is a long time. No one will think they mean that the last object left as the universe succumbs to entropy will be a disk with a PDF file, but what scale of “forever” gives sense to their claim? In a tweet responding to my skepticism, they offered a clarification: