PDF forever?

Distant galaxiesThe PDF Association has an article on its site titled “What’s unique about PDF? and why PDF will live forever.” The article claims PDF is “a format of such flexibility and power that it will define the essential ‘electronic document’ concept forever.”

Forever is a long time. No one will think they mean that the last object left as the universe succumbs to entropy will be a disk with a PDF file, but what scale of “forever” gives sense to their claim? In a tweet responding to my skepticism, they offered a clarification:

We’re talking about at least a few centuries, unless a disaster destroys civilization first. After that, something totally different from computers as we know them may replace them — maybe artificial organic brains. Let’s take the year 2500 as a reasonable approximation to forever. Some people can still read Ancient Greek and Cuneiform, so it’s likely that someone will be able to decipher PDF files then.

This won’t be because of the flexibility and power of PDF, though; it will be because someone’s still interested in old documents. No one will be using PDF for anything but historical purposes then. Electronic stored-program computers have been around since the late 1940s; their entire history so far is about one human lifetime. PDF has been around for about 22 years, or the time it takes a newborn baby to grow into an adult. That’s not a lot of time compared to even a myopic definition of forever, and it’s already seen huge changes.

People tend to assume the world will always be the way it is now. Andy Ihnatko once told me that Amazon will last forever. But things change, computers faster than anything else. Imagine going back in time to 1990 and explaining Netflix or SMS messaging to anyone. Imagine telling them that in 25 years, pocket-sized computers with gigabytes of storage will fit in people’s pockets, cost a couple of hundred dollars, and be used mostly as telephones. We can’t guess what computers will be like in 2050, let alone 2500.

PDF’s main focus is on the appearance of documents. It has features for describing structure, but they’re secondary and not easy to use. In the future, the idea of fixed-layout documents may become obsolete. New formats may treat documents as abstract information, with software laying it out as needed, speaking it, or even translating it into a foreign language. EPub is headed in that direction. Computers may someday stop using 8-bit bytes or discrete files, undermining the whole present-day concept of what a file format is.

Digital preservation efforts usually focus on keeping documents alive for a few decades. Reaching any further out than that is really hard, requiring wild guesses about what will and won’t change. The safest assumption is that everything we have today will be obsolete before very long, and we should figure out how to ease the task of people peering back into our primitive times.

One response to “PDF forever?

  1. There’s nothing unique about PDF — except perhaps that it’s an electronic representation of a printed (hardcopy) page. Which is, when you think about it, a kind of silly idea.

    The other claimed advantages: mixing pages produced in different tools, platform-independence, forms, client-side scripting (Javascript), etc. are equally well provided by:
    and probably several other formats I’m probably not aware of. Heck, you could probably store documents in Maker Interchange Format, the XML-like format used to move documents between different versions of Adobe Framemaker.

    Granted, few non-proprietary programs for handling exist at the moment and there is no published spec, but I’ve eyeballed MIF docs, and it wouldn’t take a lot to extend an XML-aware document reader to handle MIF, translate it to other formats, etc. And, if you want to preserve pages, MIF does that just as well as PDF, with the added advantage that you can make sense of a raw MIF document with just your eyes and brain.

    But really, how important is it that page formatting be retained. Aren’t we more interested in the document’s internal structure, as it relates to the meaning of the document.

    That is, we want to preserve the following concepts:
    1. The content itself: text, images, video, sound, and other multimedia content.
    2. The sequence of content: this paragraph comes before that image, which in turn is followed by that other image.
    3. Proximity. Paging is one way of representing proximity: we want “this image” to be “on the same page as” “that text” (which references “this image”). But what we _really_ want is for “this image” to be near enough to “that text” that the viewer can see both at the same time.

    HTML doesn’t quite represent that, only sequence, but we can approximate it pretty well: allow an image (or flash object or…) to float, but force it to be nearby using the “clear” CSS property. I believe that EPUB can do somewhat better, but I’m not clear on the inside details of EPUB.

    And of course HTML/CSS can _also_ represent page breaks, with the attributes page-break-before, page-break-after, and page-break-inside.

    The assertion is silly on its face.