The future of e-book formats

An article with some interesting thoughts: “Will There Ever Be A Universal, MP3-Like Standard For E-Books?”

Personally, I’d say PDF (not Epub) is to e-books what MP3 is to music files: A widely adopted, universally recognized format that no one’s entirely happy with but satisfies most people’s needs.

Undocumented “open” formats

Recently I learned that I can’t upgrade to a current version of Finale Allegro, a music entry program, except by getting the very expensive full version or taking a step downward to PrintMusic. Since I don’t want to lose all my files when some “upgrade” makes Allegro stop working, I’ve been looking for alternatives. MuseScore has its attractions; it’s open source, powerful, and generally well regarded. But I ran across this discussion on the MuseScore forum, which has me just a bit worried. According to “Thomas,” whose user ID is 1 and so probably speaks with authority, “As the MuseScore format is still being shaped on a daily basis, we haven’t put any effort yet to create a schema.”

This doesn’t encourage me to use MuseScore. Even though it’s an “open” application, its format isn’t open in any meaningful sense. You can download the code and reverse-engineer it, of course, but it’s going to change in the next version. While I’m sure the developers will try not to break files created with earlier versions, there’s no guarantee they’ll succeed, and they’re likely to be especially careless about compatibility with files that are more than a few versions old.

You can export files to MusicXML, which is standardized, but in trying this out I came upon a disturbing bug. If I edit the file and save the changes, they’re saved not to the .xml file but to a .mcsz file, MuseScore’s native format. If there’s already an older file with that name, it gets overwritten without warning.

The dichotomy between “open” and “proprietary” formats is the wrong one. There are many formats which are trademarked by a business and their documentation copyrighted, but if the documentation is public and the format not encumbered by patents, anyone can use it. Formats which are created by open-source code but are undocumented and subject to change might are effectively closed formats.

This post grew, in part, from my thoughts on avoiding data loss due to format obsolescence, which is this topic of this week’s post on Files That Last.

The HTML5 “sarcasm” tag

In the November 5 Editor’s Draft of HTML5: A vocabulary and associated APIs for HTML and XHTML, there is a curious reference to the “sarcasm” tag.

8.2.5.4.7 The “in body” insertion mode

When the user agent is to apply the rules for the “in body” insertion mode, the user agent must handle the token as follows:

An end tag whose tag name is “sarcasm”

Take a deep breath, then act as described in the “any other end tag” entry below.

This is the only reference to the tag, so I guess only the closing </sarcasm> tag is allowed, not the opening <sarcasm> tag.

Perhaps this was a test to see if anyone’s actually reading?

The email jungle

In researching tomorrow’s post on email preservation on Files That Last, I came to appreciate more thoroughly how messy email formats are. RFC 4155, which defines “the ‘default’ mbox database format” (their quotes around “default”) and application/mbox MIME type, tells us that “The mbox database format is not documented in an authoritative specification, but instead exists as a well-known output format that is anecdotally documented, or which is only authoritatively documented for a specific platform or tool.”

Some versions may have eight-bit character data with the character encoding not explicitly specified, and possibly varying from one file creator to another. The format of email addresses isn’t specified. A short page on qmail.org, referenced from RFC 4155, discusses some of the variants, including mboxo, mboxrd, mboxc1, and mboxc12. The differences may appear minor, but they’re sufficient that a parser that assumes one of the variants can fail when it encounters the others.

Then there’s the encoding issue. Most of the world has settled on MIME by now, but older archives (and perhaps some recent ones) may contain messages encoded with uuencode, BinHex, or Apple Single. The last two are found mostly with mail that was sent from Macintosh clients, but uuencode was once widely used — and poorly standardized.

An alternative email archiving format is the CERP XML schema. This looks at a glance as if it provides better structuring than MBOX, but it isn’t as widely supported.

Update: The FTL post is now available at “You HAD mail.”

New on “Files that Last”

Here’s the first post with “real content” on FTL: Metadata: What’s it all about?”

Closed access at Harvard

Sorry about the off-topic post, but this is the best channel I have for reaching the academic world.

Whatever Robert Frost may have said, something there is that really loves a wall. Specifically, fear does. The fear that looks askance at every foreign-looking person, that puts fortifications on our borders, that sees only the danger in contact from others.

Locked gate at Harvard Yard

Locked gate at Harvard Yard

A small, non-violent (with perhaps an exception or two) mob assailed Harvard Yard last Thursday night, and Harvard gave in to fear. The gates were shut or put under guard for the night, which may well have been necessary. They’ve remained that way ever since. To get into Harvard Yard, you must show an ID or have an invitation. Today employees received an email giving the weekday and weekend schedules for the gates, suggesting this won’t go away quickly.

This is inconvenient for Harvard people and more so for others who have reason to visit. The tours of Harvard Yard are on hiatus. If you have an appointment or a conference, your host has to provide a list of the people attending so they can be allowed in. Lamont Library contains a repository of government records which is open to the public without an ID — but you can’t get to Lamont.

I don’t know how long this will go on. When vague fears drive a policy and no risk is too small to ignore, there’s no reason ever to stop.

New blog: Files That Last

Today I’m launching a new tech blog, called “Files That Last.” As you might guess, its subject is digital preservation. Why do we need another preservation blog? Perhaps “we” don’t, where we’re mostly people closely connected with libraries and archives, but it’s a topic that’s ripe for more attention from the general computer-tech community, as everyone relies increasingly on computer files for long-term memory. Its focus will be practical guidance. Since it’s a solo operation, I’ll be able to say things the Library of Congress really shouldn’t.

I’ll be running that blog on a more regular schedule than this one, with weekly posts. Please drop by, and if you like what you see please spread the word.

Adobe getting out of Flash for mobile

Steve Jobs gets a posthumous victory as Adobe will not be developing Flash for mobile devices past version 11. Adobe states that:

HTML5 is now universally supported on major mobile devices, in some cases exclusively. This makes HTML5 the best solution for creating and deploying content in the browser across mobile platforms. We are excited about this, and will continue our work with key players in the HTML community, including Google, Apple, Microsoft and RIM, to drive HTML5 innovation they can use to advance their mobile browsers.

Our future work with Flash on mobile devices will be focused on enabling Flash developers to package native apps with Adobe AIR for all the major app stores. We will no longer continue to develop Flash Player in the browser to work with new mobile device configurations (chipset, browser, OS version, etc.) following the upcoming release of Flash Player 11.1 for Android and BlackBerry PlayBook. We will of course continue to provide critical bug fixes and security updates for existing device configurations.