Google Docs: Not a File Format

What’s the format of a Google Docs file? The question may not even be meaningful. According to Jenny Mitcham at the University of York, there is no such thing as a Google Docs file. What you see when you open a document is an assembly of information from a database. You can export it in various file formats, but the exported file isn’t identical to the Google document.

This makes them risky from a preservation standpoint. You can’t save a local backup of a document. If you lose your Google account, or if censorship in your country cuts you off from it, you lose all your documents.
Continue reading

Preserving and losing tax records

When you offer expert advice on something, such as digital preservation, you have to admit your own errors. I very nearly lost my 2016 tax return. When I tried to open it in TurboTax, the application just did nothing. I hadn’t exported it to a generally usable format. The TurboTax file format is proprietary and opaque.
Continue reading

File corruption and political corruption

When people who don’t understand file formats manipulate files in order to cover their tracks, they generally fail miserably. Slate magazine gives an entertaining case in point from the Trump scandals. The article says:

There are two types of people in this world: those who know how to convert PDFs into Word documents and those who are indicted for money laundering. Former Trump campaign chairman Paul Manafort is the second kind of person.

The PDF Association chimes in with additional technical details.
Continue reading

The future of TIFF

Is TIFF a legacy format?

The most recent version of the TIFF specification, 6.0, dates from 1992. Adobe updated it with three technical notes, the latest coming out in 2002. Since then there has been nothing.

The format is solid, but the past quarter-century has seen reasons to enhance it. BigTIFF is a variant of the format to accommodate larger files. It isn’t backward-compatible with TIFF, but the changes mostly concern data lengths and are easy to add to a TIFF interpreter. The format sits in a kind of limbo, since Adobe owns the spec but is no longer updating it. There have been new tags which have achieved consensus acceptance but don’t have official status. AWare Systems has a list of known tags but has no reliable way to say which ones are private and which are generally accepted. There’s no way to add a new compression or encryption algorithm, or any other new feature, and give it official status.
Continue reading

Can a .txt file contain malware?

The Internet Crime Complaint Center reported that some email messages are impersonating it in an attempt to get malware onto target computers. That’s clearly worth knowing about, but this part of the report is odd:

The unknown actors also attached a text document (.txt) to download, complete, and return to the perpetrators. The text file contained malware which was designed to further victimize the recipient.

It really shouldn’t be possible to run malware by opening a .txt file. It should just open in a text editor, with no execution of code. There’s no further explanation.
Continue reading

The inventor of binary encoding

Francis Bacon may not have written Shakespeare’s plays, but he wrote the Novum Organum, a foundational work of scientific methodology. He did something else almost as impressive: He invented the binary encoding of text. In the early 17th century he wrote:

First let all the Letters of the Alphabet, by transposition, be resolved into two Letters onely; for the transposition of two Letters by five placeings will be sufficient for 32. Differences, much more for 24. which is the number of the Alphabet . The example of such an Alphabet is on this wise.

By “transposition” he meant the use of two letters, such as A and B, as units of an encoded message. They could just as well have been 1 and 0, or any other pair. Using five letters gives 25 or 32 possible encodings. AAAAA signifies A, AAAAB is B, AAABA is C, and so on. He said there were 24 letters in the alphabet because in his time I and J were considered the same letter, as were U and V. It’s a very short hop from this encoding to Baudot, and just an extension to seven letters (bits) to get ASCII.
Continue reading

How Twitter renders GIF

I’ve long wondered how Twitter renders animated GIF files. I have Firefox set to disable GIF animation, and it works everywhere except on Twitter. Apart from that, the interface indicates something is going on beyond normal GIF display. It doesn’t animate till you hit the “Play” button, and then there’s apparently no way to stop it.
Continue reading

SMS messages and GSM encoding

Today I learned from a science fiction discussion group that SMS messages don’t use UTF-8. In fact, they don’t even use ASCII or an extension of it. It’s a case of old technology which has survived beyond its time.

The usual encoding for SMS text messages is GSM-7. Most cell phones use it, regardless of whether they’re on the GSM network or not. They generally support Unicode as well, but in a strange way.
Continue reading

What are “positives” in format validation?

Articles about JHOVE, such as Good GIF Hunting, grab my attention for obvious reasons. This article talks about false positive and negative results, and got me to thinking: What constitutes a “positive” result in file format validation? There are two ways to look at it:

  1. The default assumption is that the file is of a certain format, perhaps based on its extension, MIME type, or other metadata. The software sets out to see if it violates the format’s requirements. In that case, a positive result is that the file doesn’t conform to the requirements.
  2. The default assumption is that the file is just a collection of bytes. The software matches it against one or more sets of criteria. A positive result is that the file matches one of them.

Continue reading


The Libtiff source code repository is now on Gitlab. The old CVS repository on will be maintained for historical purposes but won’t get any updates.

One reason for choosing Gitlab rather than Github is that there’s already a libtiff repository on Github. The reasons it’s there aren’t clear, but it’s definitely not an official Libtiff repository.

The Libtiff homepage continues to be on