Monthly Archives: December 2011

The future of e-book formats

An article with some interesting thoughts: “Will There Ever Be A Universal, MP3-Like Standard For E-Books?”

Personally, I’d say PDF (not Epub) is to e-books what MP3 is to music files: A widely adopted, universally recognized format that no one’s entirely happy with but satisfies most people’s needs.

Undocumented “open” formats

Recently I learned that I can’t upgrade to a current version of Finale Allegro, a music entry program, except by getting the very expensive full version or taking a step downward to PrintMusic. Since I don’t want to lose all my files when some “upgrade” makes Allegro stop working, I’ve been looking for alternatives. MuseScore has its attractions; it’s open source, powerful, and generally well regarded. But I ran across this discussion on the MuseScore forum, which has me just a bit worried. According to “Thomas,” whose user ID is 1 and so probably speaks with authority, “As the MuseScore format is still being shaped on a daily basis, we haven’t put any effort yet to create a schema.”

This doesn’t encourage me to use MuseScore. Even though it’s an “open” application, its format isn’t open in any meaningful sense. You can download the code and reverse-engineer it, of course, but it’s going to change in the next version. While I’m sure the developers will try not to break files created with earlier versions, there’s no guarantee they’ll succeed, and they’re likely to be especially careless about compatibility with files that are more than a few versions old.

You can export files to MusicXML, which is standardized, but in trying this out I came upon a disturbing bug. If I edit the file and save the changes, they’re saved not to the .xml file but to a .mcsz file, MuseScore’s native format. If there’s already an older file with that name, it gets overwritten without warning.

The dichotomy between “open” and “proprietary” formats is the wrong one. There are many formats which are trademarked by a business and their documentation copyrighted, but if the documentation is public and the format not encumbered by patents, anyone can use it. Formats which are created by open-source code but are undocumented and subject to change might are effectively closed formats.

This post grew, in part, from my thoughts on avoiding data loss due to format obsolescence, which is this topic of this week’s post on Files That Last.

The HTML5 “sarcasm” tag

In the November 5 Editor’s Draft of HTML5: A vocabulary and associated APIs for HTML and XHTML, there is a curious reference to the “sarcasm” tag. The “in body” insertion mode

When the user agent is to apply the rules for the “in body” insertion mode, the user agent must handle the token as follows:

An end tag whose tag name is “sarcasm”

Take a deep breath, then act as described in the “any other end tag” entry below.

This is the only reference to the tag, so I guess only the closing </sarcasm> tag is allowed, not the opening <sarcasm> tag.

Perhaps this was a test to see if anyone’s actually reading?

The email jungle

In researching tomorrow’s post on email preservation on Files That Last, I came to appreciate more thoroughly how messy email formats are. RFC 4155, which defines “the ‘default’ mbox database format” (their quotes around “default”) and application/mbox MIME type, tells us that “The mbox database format is not documented in an authoritative specification, but instead exists as a well-known output format that is anecdotally documented, or which is only authoritatively documented for a specific platform or tool.”

Some versions may have eight-bit character data with the character encoding not explicitly specified, and possibly varying from one file creator to another. The format of email addresses isn’t specified. A short page on, referenced from RFC 4155, discusses some of the variants, including mboxo, mboxrd, mboxc1, and mboxc12. The differences may appear minor, but they’re sufficient that a parser that assumes one of the variants can fail when it encounters the others.

Then there’s the encoding issue. Most of the world has settled on MIME by now, but older archives (and perhaps some recent ones) may contain messages encoded with uuencode, BinHex, or Apple Single. The last two are found mostly with mail that was sent from Macintosh clients, but uuencode was once widely used — and poorly standardized.

An alternative email archiving format is the CERP XML schema. This looks at a glance as if it provides better structuring than MBOX, but it isn’t as widely supported.

Update: The FTL post is now available at “You HAD mail.”