Tag Archives: standards

OOXML: The good and the bad

An article by Markus Feilner presents a very critical view of Microsoft’s Open Office XML as it currently stands. There are three versions of OOXML — ECMA, Transitional, and Strict. All of them use the same extensions, and there’s no easy way for the casual user to tell which variant a document is. If a Word document is created on one computer in the Strict format, then edited on another machine with an older version of Word, it may be silently downgraded to Transitional, with resulting loss of metadata or other features.

On the positive side, Microsoft has released the Open XML SDK as open source on Github. This is at least a partial answer to Feilner’s complaint that “there are no free and open source solutions that fully support OOXML.”

Incidentally, I continue to hate Microsoft’s use of the deliberately confusing term “Open XML” for OOXML.

Thanks to @willpdp for tweeting the links referenced here.

Library of Congress format recommendations

The Library of Congress has issued a set of recommendations for formats for both physical and digital documents. The LoC’s digital preservation blog has an interview with Ted Westervelt of the LoC on their development. They’re not just for the library’s own staff, he explains, but for “all stakeholders in the creative process.”

The guidelines repeatedly state: “Files must contain no measures that control access to or use of the digital work (such as digital rights management or encryption).” That’s pushback that can’t be ignored. In some cases, though, the message is mixed. For theatrically released films, standard or recordable Blu-Ray is accepted, but the boilerplate against DRM is included. I don’t know where they expect to get DRM-free Blu-Ray, but DRM-free options are few when it comes to big-name movies.

It’s also interesting that software, specifically games and learning materials, is included. This has been a growing area of interest in recent years. Rather than relying on emulation, the recommendations call for source code, documentation, and a specification of the exact compiler used to build the application.

There’s material here to fuel constructive debate and expansion for years.

New blocks in Unicode 7

Unicode 7.0.0 has been released, with 2.834 new character codes. It’s been fascinating looking into some of the blocks that have been added; here’s a sampling.

Bassa Vah is a really obscure script from what is now Liberia, possibly predating the country. Old Permic is supposed to be a close relative of Cyrillic, but any visual resemblance is lost on me.

Some of the writing systems came from a religious impulse. Mende Kikakui was devised by an Islamic scholar and was once widely used for the Mende language in Africa. It’s been mostly displaced by the Latin alphabet. Shong Lue Yang introduced the Pahawh Hmong writing system for the Hmong language in southeast Asia, claiming to have received it from God. Pau Cin Hau, named after its creator, was a 20th century system used for religious writings in Burma. Its original version had over a thousand characters, but the Unicode block is based on the 57-character alphabetic system. The Manichaean alphabet is fascinating just because of its name, recalling the conflicts in early Christianity. According to tradition, Mani, the founder of Manichaeanism, created the alphabet.

Finally, one of the oldest writing systems in the world, Linear A, is new in Unicode 7. It’s from ancient Crete, and no one knows how to read its texts. Now you can create computer documents in it, if you’re a scholar of old languages or just like confusing people.

Still no Klingon, though.

Now the JHOVE UTF-8 module needs to be updated for all these new blocks.

TIFF/EP vs. Exif

I just discovered today that there are two different TIFF tags called “FocalPlaneResolutionUnit.” Tag 41488 goes by this name and is part of the Exif tag set. Accepted values for it are:

  • 1 = No absolute unit of measurement
  • 2 = Inch
  • 3 = Centimeter

Tag 37392 is a TIFF/EP (Electronic Photography) tag (working draft, final version not available online), also used in other raw formats, including DNG. Its accepted values are:

  • 1 = Inch
  • 2 = Metre
  • 3 = Centimetre
  • 4 = Millimetre
  • 5 = Micrometre

Recently I was sent a TIFF file, as a JHOVE issue, that had a tag 41488 with a value of 4. JHOVE correctly, but perhaps confusingly, reported that the fFocalPlaneResolutionUnit tag had an invalid value.

There are other tags in TIFF/EP that are equivalent, or nearly, to Exif tags. In some cases their values are identically specified, sometimes not. The Exif SubjectLocation tag is numbered 41492 and always has two shorts for its value, giving an X and Y value. The TIFF/EP counterpart is tag 37396, which can also have three shorts (specifying a circle) or four (specifying a rectangle).

I don’t know how this came about, but it’s something to watch out for in software that deals with both Exif and TIFF/EP tags. Some software may accept the EP extensions for Exif tags, but there’s no guarantee this will work.

A PDF question

A while back, I posted a question on superuser.com about a PDF issue that’s causing problems in JHOVE. So far it hasn’t gotten any answers, so I’m signal-boosting my own question here. Here’s what I asked:

The JHOVE parser for PDF, which I maintain, will sometimes find a non-dictionary object in a PDF’s Annots array. According to section 8.4.1 of the PDF spec, the Annots array holds “an array of annotation dictionaries.” In the case that I’m looking at right now, there’s a keyword of “Annot” instead of a dictionary. Is this an invalid PDF file, or is there a subtlety in the spec which I’ve overlooked?

Answering on stackoverflow.com is best, so other people can see the answer, but if you prefer to answer here, I’ll post or summarize any useful response, with attribution, as an answer over there.

The future of WebM

Yesterday I posted about the WebP still image format, expressing some skepticism about how easily it will catch on. Its companion format for video, WebM, may stand a better chance, though. Images aren’t exciting any more; JPEG delivers photographs well enough, PNG does the same for line art, and there isn’t a compelling reason to change. Video is still in flux, though, and the high bandwidth requirements mean there’s a payoff for any improvements in compression and throughput. The long-running battle among HTML5 stakeholders over video shows that it’s far from being a settled area. Patents are a big issue; if you implement H.264, you have to pay money. Alternatives are attractive from both a technological and an economic standpoint.

With Google pushing WebM and having YouTube, there’s a clear reason for browser developers to support it. YouTube plans to use the new WebM codec, VP9, once it’s complete. I haven’t seen details of the plan, but most likely YouTube will make the same video available with multiple protocols and query the browser’s capabilities to determine whether it can accept VP9. If the advantage is real and users who can get it see fewer pauses in their videos, more browser makers will undoubtedly join the bandwagon.

The disappearing format blues

Old formats sometimes fade into obscurity and can no longer be supported, even if they come from a big company like Microsoft. Chris Rusbridge has noted that Microsoft’s Open Specifications page only goes as far back as Office 97, and that PowerPoint 4.0 files can’t be opened with today’s Microsoft Office. Tony Hey at Microsoft has replied. (Hey is vice president of Microsoft Research Connections). The response was encouraging, particularly in suggesting that Microsoft might “participate in a ‘crowd source’ project working with archivists to create a public spec of these old file formats.”

There’s usually some kind of software around that can read old formats. A search turns doesn’t turn up a lot; there’s something called PowerPressed, which will wrap old PowerPoint files in a .exe application. It looks as if it should run on current Windows systems, but all I know is what that page says.

The situation shows the risk of using a format that isn’t publicly documented. Today this is less of a problem. I think it’s been shown that publishing format specs doesn’t lead to cannibalization of sales by competing software; the company that created the spec is in a position to produce the best implementation. The description of PDF is fully public, and Adobe still dominates the market for PostScript software. Publishing the spec has just made the pie bigger. There’s still quite a lot of software that uses unpublished proprietary specs, though, and it’s risky to rely on the long-term reliability of the files they produce.

Embracing the chaos of formats

We often think of formats in terms of specifications and standards, and this can be a useful thing. If you want to know exactly what the Latin-1 encoding is, you can look at the ISO-8859-1 standard and it will tell you. However, this isn’t always a reliable guide to what’s out there. Someone noticed that ISO-8859 reserves lots of control codes that are rarely used and put additional printing characters there. This got codified as well, as Windows 1252 (which Microsoft falsely claims as an ANSI standard), but there are many ad hoc or obscure encodings which are hard or impossible to find references for.

Earth’s official authorities refused to grant the Klingons a place in Unicode for their characters; nonetheless, there is an unofficial registry that uses part of the Unicode Private Use Area for Klingon and other constructed scripts. Is it official Unicode? No. If you use code points F8D0-F8FF, will others recognize them as Klingon characters? Sometimes.

I’ve written about the TIFF situation before. The TIFF 6.0 spec is an insufficient guide to today’s real-life TIFF. You have to go through scattered tech notes to understand how it’s really used.

Understanding situations like these requires understanding that formats don’t flow unchanged from the minds of their designers to their implementation in the world’s computers. People change things to meet their needs. This makes them more useful for some purposes; at the same time, it makes them more confusing. The only alternative would be to create a format police force with the power to arrest and punish innovators.

The situation is analogous to natural language. You can insist that anything that disagrees with the grammar books is wrong, but if everybody talks that way, there ain’t no stoppin’ it. At the same time, the grammar books put a brake on unnecessary change, keeping the language from breaking down into a thousand mutually unintelligible dialects.

Digital preservationists have to look at the actual usage of formats, not just their official specifications. This doesn’t mean that they should accept every deviation, but they need to acknowledge changes that have become de facto standards. Context matters; an archive of ninteenth-century literature doesn’t have to be concerned with Klingon characters, but an archive of science fiction fan literature had better take them into account. Even an occasional scholarly paper might have a word or two in the pIqaD script.

This proliferation of variants is a big part of why centralized registries of format information don’t work. Not only is there too much information, it keeps changing. The best we can hope for is a coordinated way of finding our way through a chaotic body of information.

PDF/A-3

The latest version of PDF/A, a subset of PDF suitable for long-term archiving, is now available as ISO standard 19005-3:2012. According to the PDF/A Association Newsletter, “there is only one new feature with PDF/A-3, namely that any source format can be embedded in a PDF/A file.”

This strikes me as a really bad idea. The whole point of PDF/A is to restrict content to a known, self-contained set of options. The new version provides a back door that allows literally anything. The intent, according to the article, is to let archivists save documents in their original format as well as their PDF representation. Certainly saving the originals is a good archiving practice, but it should be done in an archival package, not in a PDF format designed for archiving.

Mission creep afflicts projects of all kinds, and this is a case in point.