In researching tomorrow’s post on email preservation on Files That Last, I came to appreciate more thoroughly how messy email formats are. RFC 4155, which defines “the ‘default’ mbox database format” (their quotes around “default”) and application/mbox MIME type, tells us that “The mbox database format is not documented in an authoritative specification, but instead exists as a well-known output format that is anecdotally documented, or which is only authoritatively documented for a specific platform or tool.”
Some versions may have eight-bit character data with the character encoding not explicitly specified, and possibly varying from one file creator to another. The format of email addresses isn’t specified. A short page on qmail.org, referenced from RFC 4155, discusses some of the variants, including mboxo
, mboxrd
, mboxc1
, and mboxc12
. The differences may appear minor, but they’re sufficient that a parser that assumes one of the variants can fail when it encounters the others.
Then there’s the encoding issue. Most of the world has settled on MIME by now, but older archives (and perhaps some recent ones) may contain messages encoded with uuencode, BinHex, or Apple Single. The last two are found mostly with mail that was sent from Macintosh clients, but uuencode was once widely used — and poorly standardized.
An alternative email archiving format is the CERP XML schema. This looks at a glance as if it provides better structuring than MBOX, but it isn’t as widely supported.
Update: The FTL post is now available at “You HAD mail.”
The curse of HTML mail
It’s been most of a year since I last posted here, but I wanted to rant about HTML mail, and this is the right blog for it. People complain about the intrusiveness of Web tracking, but email tracking is even worse. I’ve noticed this especially after subscribing to a couple of Substack newsletters. They’re sent as HTML, and whenever possible, I click the link to the equivalent Web page, which is less intrusive. Every link in a Substack newsletter is a tracking link, with the odd exception of the link to the Substack page.
The links in a Substack newsletter don’t go to the target page but to a Substack redirection URL. Their purpose is to let Substack know about everything you click on. There are no terms or privacy policy in the email telling you what Substack uses the information for.
Continue reading →Leave a comment
Posted in commentary
Tagged email, HTML