Today I learned from a science fiction discussion group that SMS messages don’t use UTF-8. In fact, they don’t even use ASCII or an extension of it. It’s a case of old technology which has survived beyond its time.
The usual encoding for SMS text messages is GSM-7. Most cell phones use it, regardless of whether they’re on the GSM network or not. They generally support Unicode as well, but in a strange way.
July 17 was World Emoji Day. Anyone can declare a World Anything Day, but my local library thought it was important enough to give it part of a sign, along with Cell Phone Courtesy Month.
They didn’t think it was important enough to give accurate information, though. It does tell us something about how non-tech people think of emoji. Here’s the content of the sign, with commentary.
In 2001, the Unicode Consortium rejected a proposal to include the Klingon encoding. The reasons it gave were:
Lack of evidence of usage in published literature, lack of organized community interest in its standardization, no resolution of potential trademark and copyright issues, question about its status as a cipher rather than a script, and so on.
Fair enough, but don’t most of these objections apply equally to emoji?
In Orwell’s 1984, the Newspeak language followed the principle that if you can abolish certain words, you can abolish the thoughts that go with them.
It was intended that when Newspeak had been adopted once and for all and Oldspeak forgotten, a heretical thought — that is, a thought diverging from the principles of Ingsoc — should be literally unthinkable, at least so far as thought is dependent on words. … This was done partly by the invention of new words, but chiefly by eliminating undesirable words and by stripping such words as remained of unorthodox meanings, and so far as possible of all secondary meanings whatever.
Apple is doing something like this with Unicode codepoint U+1F52B (🔫), which the code chart defines as PISTOL, with the explanatory text of “handgun, revolver.” There’s nothing that suggests it’s supposed to represent a water gun or any other kind of toy. However, Apple has elected to represent this character as a water pistol in iOS 10.
The Unicode Consortium has announced the release of Unicode 9.0. It adds character sets for some little-known languages, including Osage, Nepal Bhasa, Fulani, the Bravanese dialect of Swahili, the Warsh orthography for Arabic, and Tangut. It updates the collation specification and security recommendations.
Most Unicode implementations will require just font upgrades, but full support of some of the more unusual scripts will require attention to the migration notes.
“Asymmetric case mapping” sounds interesting. I believe this means that the conversion between upper case and lower case isn’t one-to-one and reversible. The notes give the example of “the asymmetric case mapping of Greek final sigma to capital sigma.” Lowercase sigma has two forms; it’s σ except at the end of a word, where it’s ς. Both turn into Σ in uppercase.
What really has people excited about Unicode 9, if a Startpage search is any indication, isn’t any of these things, but that about 1% of the new characters are emoji and that Apple and Microsoft lobbied against one candidate emoji. I wonder if the Unicode Consortium regrets having gotten involved in that mess in the first place. There are no possible criteria except whims for what the set should include. There’s no limit on how many could be added. OK, having a universally set of encodings promotes information interchange, but the tail is wagging the 🐕.
By the way, what’s the plural of “emoji”? I use “emoji” as both singular and plural, but I’m seeing “emojis” with increasing frequency. It just looks wrong to me. Does anyone say “kanjis” or “romajis” for the other Japanese character sets? I had to argue with the editor to keep the title of my article “The War on Emoji” that way.
Posted in News
This post may be illegal in Indonesia. It includes the code point sequence U+1F468 U+200D U+2764️ U+FE0F U+200D U+1F48B U+200D U+1F468, which renders as the emoji 👨❤️💋👨 or “man kissing man.” According to a Time article, the Indonesian Ministry of Communication and Informatics is “asking” Facebook to block the use of “gay” emoji. Failure to comply could mean the Negative Content Management Panel (George Orwell would have been impressed!) will block Facebook in Indonesia.
Emoji have generated several controversies already, but this is the first I’ve heard of a government censoring code points. It’s couched in terms of “sensitivity,” “respect,” and protecting children.
Unicode is a great thing, but sometimes its thoroughness poses problems. Different character sets often include characters that look exactly like common ASCII characters in most fonts, and these can be used to spoof domain names. Sometimes this is called a homograph attack or script spoofing. For instance, someone might register the domain gοοgle.com, which looks a lot like “google.com,” but actually uses the Greek letter omicron instead of the Roman letter o. (Search this page in your browser for “google” if you don’t believe me.) Such tricks could lure unwary users into a phishing site. A real-life example, which didn’t even require more than ASCII, was a site called paypaI.com — that’s a capital I instead of a lower-case L, and they look the same in some fonts. That was way back in 2000.
What’s your favorite character? Luke Skywalker? Georgia Mason? Captain Ahab?
Oh, sorry, we’re not talking about that kind of character. We’re talking about characters like the Hungarian double-acute u (ű), the four-leaf clover emoji (🍀), or the Katakana “ka” (カ). The Unicode Consortium is looking for people to “adopt” their favorite characters with a tax-deductible donation. Each character can have one Gold ($5000) sponsor, five Silver ($1000) sponsors, and any number of Bronze ($100) sponsors. As I read the rules, only recognized Unicode characters are eligible, so you probably can’t support Klingon characters.
Posted in News
Encoding all the characters of all the world’s languages is an endless task. Unicode 8.0 improves the treatment of Cherokee, Tai Lue, Devangari, and more. For a lot of people, the most interesting part will be the implementation of “diverse” emoji in a variety of colors. A Unicode Consortium report explains:
People all over the world want to have emoji that reflect more human diversity, especially for skin tone. The Unicode emoji characters for people and body parts are meant to be generic, yet following the precedents set by the original Japanese carrier images, they are often shown with a light skin tone instead of a more generic (nonhuman) appearance, such as a yellow/orange color or a silhouette.
Five symbol modifier characters that provide for a range of skin tones for human emoji are planned for Unicode Version 8.0 (scheduled for mid-2015). These characters are based on the six tones of the Fitzpatrick scale, a recognized standard for dermatology (there are many examples of this scale online, such as FitzpatrickSkinType.pdf). The exact shades may vary between implementations.
… When a human emoji is not immediately followed by a emoji modifier character, it should use a generic, non-realistic skin tone.
I’ve updated the UTF-8 module in the JHOVE source on Github to include the new code blocks for Unicode 7.0.0. Also, I’ve recently fixed the pom.xml file so it will put both the command line and the GUI JAR files into the local repository.
I need more input before I’m comfortable with creating a release 1.12 of JHOVE. I don’t have any prior experience with creating a public, open-source project that’s built with Maven, and I don’t know how much of the baggage of the SourceForge project really needs to be kept. There are some specialty JARs in the old project, but I don’t know if anyone uses them. Most importantly, there still needs to be a distribution in Zip and Tar formats. New features would be interesting, but the first thing is to make a JHOVE that was as useful as it was before.
Comments, suggestions, and code contributions are welcome, as always.