SMS messages and GSM encoding

Today I learned from a science fiction discussion group that SMS messages don’t use UTF-8. In fact, they don’t even use ASCII or an extension of it. It’s a case of old technology which has survived beyond its time.

The usual encoding for SMS text messages is GSM-7. Most cell phones use it, regardless of whether they’re on the GSM network or not. They generally support Unicode as well, but in a strange way.

It’s messy. GSM-7 isn’t compatible with ASCII. Take a look at this GSM-7 table, which includes Latin-1 code points for comparison. The upper and lower-case letters of the Roman alphabet are in the same place, but not much else is.

The encoding isn’t even one character per byte. The bytes are packed so that 8 characters of 7-bit data will fit in 7 bytes. The maximum length of a GSM message is 140 bytes, but that lets it include 160 characters. The design is a remnant of a time when bandwidth was expensive.

Just 128 code points aren’t enough for international use, so GSM-7 includes a shifting scheme to select an encoding. A single character can be shifted, or the header can specify an encoding table for the whole message. It’s like the old days of ISO 8859, except that you can mix different encoding tables in one message.

GSM-7 and Unicode

Modern phones support Unicode for non-GSM characters. This gives a larger set of characters at a cost in message length. The encoding isn’t UTF-8 or UTF-16, but UCS-2. That encoding is considered obsolete for everything except text messaging. It supports only the 65,536 code points of the Basic Multilingual Plane (BMP), and it doesn’t have a mechanism for character directionality (left to right vs. right to left).

If you send a Unicode text message to a phone that doesn’t support UCS-2, the characters will look like gibberish. A message is limited to a fixed number of bytes, so two-byte characters will decrease the maximum number of characters in the message. Fonts can be another limitation; if you send characters that the recipient doesn’t have a font for, they’ll appear as placeholders.

SMS message format

Let’s take a step back and look at the GSM SMS message format. ETSI TS 123 040 gives all the information about the SMS standard. (The Library of Congress cites RFC 5724 as the defining standard, but that only defines a URI for SMS over the Internet.)

Most of the standard deals with transport and command protocols, and encoding is barely mentioned, but we can see from it that a message can be encoded as GSM-7 or UCS-2. A UCS-2 message can hold only 70 characters. SMS can transparently concatenate multiple messages, so you can send longer messages, but they’ll count as two or more messages against your allowance. GSM-7 shifting is more efficient if you’re sending a message that’s all in one alphabetic language, and I assume most phones default to the encoding for the country they’re sold in.

Emoji

The complication increases when we look at emoji. Shigetaka Kurita is generally credited as the father of emoji on cell phones. In 1998 he devised a set of 176 black-and-white picture characters for NTT DoCoMo. They obviously weren’t listed in any Unicode tables. Even before that, the Pocket Bell pager in Japan had a heart symbol — perhaps the first emoji.

At first emoji were a Japanese phenomenon, and each provider had its own encoding for them. Naturally, people wanted standardization; it would be bad for friendships if one manufacturer’s “love” emoji translated into another one’s “dung pile.” In 2010, the Unicode Consortium devised a standard encoding for emoji which everyone could agree on. Some code points are outside the Basic Multilingual Plane, which presumably means they can’t be used in SMS. Unicode doesn’t specify emoji appearance, only a short text description, so emoji don’t always look the same on different brands of equipment.

Somehow it all works, more or less. It’s an example of how “good enough” standards stay in place, rather than tearing everything up to adopt something better.

Comments are closed.