The PDF search problem

An article from the PDF Association points out the pitfalls in searching PDF documents. Even if a document has actual text in it, rather than being a scanned image, it might not hold the text in the natural character ordering. PDF is a format for rendering a document’s visible appearance, and it isn’t so good at holding semantic content. Chunks of text can be stored out of sequence as long as they render in the right place.

The article notes that tagged PDF is more easily searchable, since tagging is supposed to reflect the logical structure of the document. It suggests that search software can do better by using heuristics but doesn’t go into details. A plausible strategy would be to determine the position of text chunks on the page and treat visually adjacent chunks as logically sequential.

An amusing way to tell how well suited a PDF is for searching is to ask your computer to speak it. A lot of reader software has this option, even if you never have a use for it. I just tried it on a flyer and found some issues with ordering. Aside from that, blanks to fill in were entered as a series of underscores, so the software started going “underscore-underscore-underscore-underscore-…” And then its pitch started rising. If you’ve seen any talking computers do this on the original Star Trek, you know it’s a bad sign and you should head for cover immediately. But I digress.

Documents that have a search problem also have accessibility and preservation problems. If they give trouble determining the correct textual order, they’ll also give content extraction software, such as readers and format converters, a problem.

The article says “Don’t blame the documents.” This may be good advice for end users, just because blaming them won’t help, but really they are to blame. There are ways to create PDF documents that avoid most of the problems, and they should be used for any documents intended for long-term retention.

