The email jungle

In researching tomorrow’s post on email preservation on Files That Last, I came to appreciate more thoroughly how messy email formats are. RFC 4155, which defines “the ‘default’ mbox database format” (their quotes around “default”) and application/mbox MIME type, tells us that “The mbox database format is not documented in an authoritative specification, but instead exists as a well-known output format that is anecdotally documented, or which is only authoritatively documented for a specific platform or tool.”

Some versions may have eight-bit character data with the character encoding not explicitly specified, and possibly varying from one file creator to another. The format of email addresses isn’t specified. A short page on qmail.org, referenced from RFC 4155, discusses some of the variants, including mboxo, mboxrd, mboxc1, and mboxc12. The differences may appear minor, but they’re sufficient that a parser that assumes one of the variants can fail when it encounters the others.

Then there’s the encoding issue. Most of the world has settled on MIME by now, but older archives (and perhaps some recent ones) may contain messages encoded with uuencode, BinHex, or Apple Single. The last two are found mostly with mail that was sent from Macintosh clients, but uuencode was once widely used — and poorly standardized.

An alternative email archiving format is the CERP XML schema. This looks at a glance as if it provides better structuring than MBOX, but it isn’t as widely supported.

Update: The FTL post is now available at “You HAD mail.”

Comments are closed.