Monthly Archives: May 2015

File identification tools, part 2: file

A widely available file identification tool is simply called file. It comes with nearly all Linux and Unix systems, including Macintosh computers running OS X. Detailed “man page” documentation is available. It requires using the command line shell, but its basic usage is simple:

file [filename]

file starts by checking for some special cases, such as directories, empty files, and “special files” that aren’t really files but ways of referring to devices. Second, it checks for “magic numbers,” identifiers that are (hopefully) unique to the format near the beginning of the file. If it doesn’t find a “magic” match, it checks if the file looks like a text file, checking a variety of character encodings including the ancient and obscure EBCDIC. Finally, if it looks like a text file, file will attempt to determine if it’s in a known computer language (such as Java) or natural language (such as English). The identification of file types is generally good, but the language identification is very erratic.

The identification of magic numbers uses a set of magic files, and these vary among installations, so running the same version of file on different computers may produce different results. You can specify a custom set of magic files with the -m flag. If you want a file’s MIME type, you can specify --mime, --mime-type, or --mime-encoding. For example:

file --mime xyz.pdf

will tell you the MIME type of xyz.pdf. If it really is a PDF file, the output will be something like

xyz: application/pdf; charset=binary

If instead you enter

file --mime-type xyz.pdf

You’ll get

xyz.pdf: application/pdf

If some tests aren’t working reliably on your files, you can use the -e option to suppress them. If you don’t trust the magic files, you can enter

file -e soft xyz.pdf

But then you’ll get the uninformative

xyz.pdf: data

The -k option tells file not to stop with the first match but to apply additional tests. I haven’t found any cases where this is useful, but it might help to identify some weird files. It can slow down processing if you’re running it on a large number of files.

As with many other shell commands, you can type file --help to see all the options.

file can easily be fooled and won’t tell you if a file is defective, but it’s a very convenient quick way to query the type of a file.

Windows has a roughly similar command line tool called FTYPE, but its syntax is completely different.

Next: DROID and PRONOM. To read this series from the beginning, start here.

File identification tools, part 1

This is the start of a series on software for file identification. I’ll be exploring as broad a range as I reasonably can within the blog format, covering a variety of uses. I’m most familiar with the tools for preservation and archiving, but I’ll also look at tools for the end user and at digital forensics (in the proper sense of the word, the resolution of controversies).

We have to start with what constitutes “identifying” a file. For our purposes here, it means at least identifying its type. It can also include determining its subtype and telling you whether it’s a valid instance of the type. You can choose from many options. The simplest approach is to look at the file’s extension and hope it isn’t a lie. A little better is to use software that looks for a “magic number.” This gives a better clue but doesn’t tell you if the file is actually usable. Many tools are available that will look more rigorously at the file. Generally the more thorough a tool is, the narrower the range of files it can identify.

Identification software can be too lax or too strict. If it’s too lax, it can give broken files, perhaps even malicious ones, its stamp of approval. If it’s too severe, it can reject files that deviate from the spec in harmless and commonly accepted ways. Some specifications are ambiguous, and an excessively strict checker might rely on an interpretation which others don’t follow. A format can have “dialects” which aren’t part of the official definition but are widely used. TIFF, to name one example, is open to all of these problems.

Some files can be ambiguous, corresponding to more than one format. Here’s a video with some head-exploding examples. It’s long but worth watching if you’re a format junkie.

The examples in the video may seem far-fetched, but there’s at least one commonly used format that has a dual identity: Adobe Illustrator files. Illustrator knows how to open a .ai file and get the application-specific data, but most non-Adobe applications will see it as a PDF file. Ambiguity can be a real problem when file readers are intentionally lax and try to “repair” a file. Different applications may read entirely different file types and content from the same file, or the same file may have different content on the screen and when printed. So even if an identification tool tells you correctly what the format is, that may not be the whole story. I don’t know of any tool that tries to identify multiple formats for the same file.

Knowing the version and subtype of a file can be important. When an application reads a file in a newer version than it was written for, it may fail unpredictably, and it’s likely to lose some information. Some applications limit their backward compatibility and may be unable to read old versions of a format. Subtypes can indicate a file’s suitability for purposes such as archiving and prepress.

I’ll use the tag “fident” for all posts in this series, to make it easy to grab them together.

Next: The shell file command line tool.

Dataliths vs. the digital dark age

Digital technology has allowed us to store more information at less cost than ever before. At the same time, it’s made this information very fragile in the long term. A book can sit in an abandoned building for centuries and still be readable. Writing carved in stone can last for thousands of years. The chances that your computer’s disk will be readable in a hundred years are poor, though. You’ll have to go to a museum for hardware and software to read it. Once you have all that, it probably won’t even spin up. If it does, the bits may be ruined. In five hundred years, its chance of readability will be essentially zero.

Archivists are aware of this, of course, and they emphasize the need for continual migration. Every couple of decades, at least, stored documents need to be moved to new media and perhaps updated to a new format. Digital copies, if made with reasonable precautions, are perfect. This approach means that documents can be preserved forever, provided the chain never breaks.

Fortunately, there doesn’t have to be just one chain. The LOCKSS (lots of copies keeps stuff safe) principle means that the same document can be stored in archives all over the world. As long as just one of them keeps propagating it, the document will survive.

Does this make us really safe from the prospect of a digital dark age? Will a substantial body of today’s knowledge and literature survive until humans evolve into something so different that it doesn’t matter any more? Not necessarily. To be really safe, information needs to be stored in a form that can survive long periods of neglect. We need dataliths.

Several scenarios could lead to neglect of electronic records for a generation or more. A global nuclear war could destroy major institutions, wreck electronic devices with EMPs, and force people to focus on staying alive. An asteroid hit or a supervolcano eruption could have a similar effect. Humanity might surive these things but take a century or more to return to a working technological society.

Less spectacularly, periods of intense international fear or attempts to manage the world economy might create an unfriendly climate for preserving records of the past. The world might go through a period of severe censorship. Lately religious barbarians have been sacking cities and destroying historical records that don’t fit with their doctrines. Barbarians generally burn themselves out quickly, but “enlightened” authorities can also decide that all “unenlightened” ideas should be banished for the good of us all. Prestigious institutions can be especially vulnerable to censorship because of their visibility and dependence on broad support. Even without legal prohibition, archival culture may shift to decide that some ideas aren’t worth preserving. Either way, it won’t be called censorship; it will be called “fair speech,” “fighting oppression,” “the right to be forgotten,” or some other euphemism that hasn’t yet lost credibility.

How great is the risk of these scenarios? Who can say? To calculate odds, you need repeatable causes, and the technological future will be a lot different from the comparatively low-tech past. But if we’re thinking on a span of thousands of years, we can’t dismiss it as negligible. Whatever may happen, the documents of the past are too valuable to be maintained only by their official guardians.

Hard copy will continue to be important. It’s also subject to most of the forms of loss I’ve mentioned, but some of it can survive for many years with no attention. As long as someone can understand the language it’s written in, or as long as its pictures remain recognizable, it has value. However, we can’t back away from digital storage and print everything we want to preserve. The advantages of bits are clear: easy reproduction and high storage density. This isn’t to say that archivists should abandon the strategy of storing documents with the best technology and migrating them regularly. In good times, that’s the most effective approach. But the bigger strategy should include insurance against the bad times, a form of storage that can survive neglect. Ideally it shouldn’t be in the hands of Harvard or the Library of Congress, but of many “guerilla archivists” acting on their own.

This strategy requires a storage medium which is highly durable and relatively simple to read. It doesn’t have to push the highest edges of storage density. It should be the modern equivalent of the stone tablet, a datalith.

There are devices which tend in this direction. Milleniata claims to offer “forever storage” in its M-Disc. Allegedly it has been “proven to last 1,000 years,” though I wonder how they managed to start testing in the Middle Ages. A DVD uses a complicated format, though, so it may not be readable even if it physically lasts that long. Hitachi has been working on quartz glass data storage that could last for millions of years and be read with an optical microscope. This would be the real datalith. As long as some people still know today’s languages, pulling out ASCII data should be a very simple cryptographic task. Unfortunately, the medium isn’t commercially available yet. Others have worked on similar ideas, such as the Superman memory crystal. Ironically, that article, which proclaims “the first document which will likely survive the human race,” has a broken link to its primary source less than two years after its publication.

Hopefully datalith writers will be available before too long, and after a few years they won’t be outrageously expensive. The records they create will be an important part of the long-term preservation of knowledge.