Tag Archives: Unicode

Update on JHOVE

I’ve updated the UTF-8 module in the JHOVE source on Github to include the new code blocks for Unicode 7.0.0. Also, I’ve recently fixed the pom.xml file so it will put both the command line and the GUI JAR files into the local repository.

I need more input before I’m comfortable with creating a release 1.12 of JHOVE. I don’t have any prior experience with creating a public, open-source project that’s built with Maven, and I don’t know how much of the baggage of the SourceForge project really needs to be kept. There are some specialty JARs in the old project, but I don’t know if anyone uses them. Most importantly, there still needs to be a distribution in Zip and Tar formats. New features would be interesting, but the first thing is to make a JHOVE that was as useful as it was before.

Comments, suggestions, and code contributions are welcome, as always.

New blocks in Unicode 7

Unicode 7.0.0 has been released, with 2.834 new character codes. It’s been fascinating looking into some of the blocks that have been added; here’s a sampling.

Bassa Vah is a really obscure script from what is now Liberia, possibly predating the country. Old Permic is supposed to be a close relative of Cyrillic, but any visual resemblance is lost on me.

Some of the writing systems came from a religious impulse. Mende Kikakui was devised by an Islamic scholar and was once widely used for the Mende language in Africa. It’s been mostly displaced by the Latin alphabet. Shong Lue Yang introduced the Pahawh Hmong writing system for the Hmong language in southeast Asia, claiming to have received it from God. Pau Cin Hau, named after its creator, was a 20th century system used for religious writings in Burma. Its original version had over a thousand characters, but the Unicode block is based on the 57-character alphabetic system. The Manichaean alphabet is fascinating just because of its name, recalling the conflicts in early Christianity. According to tradition, Mani, the founder of Manichaeanism, created the alphabet.

Finally, one of the oldest writing systems in the world, Linear A, is new in Unicode 7. It’s from ancient Crete, and no one knows how to read its texts. Now you can create computer documents in it, if you’re a scholar of old languages or just like confusing people.

Still no Klingon, though.

Now the JHOVE UTF-8 module needs to be updated for all these new blocks.

In MS Word, the bullet bites back

There’s nothing new about Microsoft’s ignoring standards and ruining compatibility, but knowing the details is useful. One case I just learned about, from Mark Mandel, is the way it does bullet lists. This applies to the old Word DOC format on Mac OS X.

A 2008 OpenOffice Forum discussion explains the problem. If you create a bullet list in Word and import it into OpenOffice, the bullets are turned into something odd-looking. The file doesn’t use Unicode bullets, but instead uses the Microsoft Symbol font, which has its own nonstandard encoding. This applies only to bullets generated by list styles, not to ones you type in. On Windows, OpenOffice will display the files correctly, since it has access to the needed fonts and mapping.

Apparently the issue can also be manifested when creating a DOC file with OpenOffice and importing into Word, though I’m not clear on how that happens.

The problem is that Word 97/2000/2002 isn’t fully Unicode-compatible, mapping Unicode characters to the 8-bit encodings that its fonts need. This has presumably been fixed in the more recent versions that use DOCX (Office Open XML), but DOC is still widely used as an interchange format, so it’s an important issue. It’s also an illustration of the risks of using undocumented interchange formats.

A history of character encodings

Here’s a nice little history of character encodings, from ASCII through UTF-8.

It doesn’t really “date back to the earliest days of computers”; before ASCII there was a jumble of incompatible character encodings, some using as few as 5 bits. Even afterward, a bizarre IBM encoding called EBCDIC hung on for many years. But the path from ASCII to its descendants is fascinating enough by itself.

Thanks to Andy Jackson for the link.

SourceForge security incident and doppelgänger characters

This morning I got an email from SourceForge saying that all passwords had been reset because of a password sniffing incident. Naturally, I’m suspicious of all email of this kind, but I do have a SourceForge account. So rather than follow any of the links in the mail, I tried to log in normally and found that passwords were in fact reset. I followed the procedure for resetting by email and my account’s working again.

I’m sure some of you reading this also have SourceForge accounts, so this bit of reassurance may be helpful, especially if your phishing filters (philters?) kept you from seeing the notice in the first place. It’s likely some fakers will set up scams to take advantage of this issue, so always go to the SourceForge website by typing in the URL or using a bookmark, rather than by following a link from email. It’s easy to mistake a near-lookalike URL on a quick glance.

Worse yet (yes, this post has something to do with formats), there are now exact lookalike URL’s, thanks to the unfortunate policy of allowing Unicode in URL’s. There are numerous cases where characters in non-English character sets normally look just like letters of the Roman alphabet. Someone could, in principle, register sourceforgе.net, which looks just like sourceforge.net — but do a local text search for “sourceforge” in your browser, and you’ll notice the first “sourceforgе.net” (and this one) are skipped over. The sixth letter isn’t the ASCII letter “e” but the Russian letter “e,” which usually looks the same or very nearly.

If your browser doesn’t have a Cyrillic font, you may be seeing a placeholder glyph instead. Or if it views the page in Latin-1 instead of UTF-8, you may see a Capital D followed by a Greek lower-case mu.

With any email that offers to correct a password issue, exercise extreme caution, even though some are legitimate.


I was a little amazed and very amused to see that one of the new features of Unicode 6.0, released just last month, is the Emoji symbol set, which is reported to be widely used on Japanese cell phones. These whimsical symbols must open all kinds of possibilities for text messaging.

Unicode may not officially include Klingon characters, but it can still allow for fun.

Unicode 5.2.0

Unicode 5.2.0 is now out. It adds 6,648 new characters but still doesn’t officially include Klingon.