2020-08-27

💩

 I have been involved with SMS (i.e. text messaging) for a long time. I was even on the ETSI committees that designed GSM (not specifically SMS, sadly), and have been doing things with SMS for nearly 30 years in one way or another, including an SMS->fax/email gateway, and even the ETSI landline SMS module for asterisk. Now, at A&A, we have code to send and receive SMS via a variety of carriers and even a SIP a-law based ETSI landline SMS system.

The specification for SMS is a typical telecoms specification - very different to internet specifications where single bits packed in some small data header can subtly change the interpretation of some or all that follows. These specifications are normally very precise but absolutely horrid, in my view.

But where does the pile of poo come in, and how does it relate to a 30 year old specification for SMS? Well, you may be surprised, but SMS allows for 💩.

SMS are actually coded in the signalling used for calls, and so had limited space. There were actually only 140 bytes (or more correctly octets) of data for the text itself. As you may know SMS allow 160 characters, so this is achieved by packing a 7 bit alphabet in to the 140 bytes.

In fact SMS allows 4 ways the data can be coded, a 7 bit special alphabet, an 8 bit Latin-1 alphabet, and 16 bit unicode (allowing 70 characters). There are also ways to send one longer message in smaller parts. The SMS can also be raw data to be sent to a SIM rather than displayed. Had I written this I'd have used 2 bits to say which it is, but no, the specification uses a Data Coding Scheme which is complicated to say the least. Some times the coding is in 2 bits but others it is implied. It is not fun.

The 7 bit alphabet is sort of ASCII, but does allow some interesting characters - being a European spec it includes some accented characters and even some Greek letters.

Of course this also leaves out some key ASCII such as {, }, [ ], and does not even have € (which was added later). These are coded as two character sequences using ESC.

The 8 bit character set is just normal Latin 1, and the 16 bit is unicode. The unicode allows all unicode characters U+0000 to U+FFFF, but where is pile of poo? It is U+1F4A9 which is too big for 16 bits.

The way this is done is to use a little known trick called UTF-16. There are reserved 16 bit unicode characters U+D800 to U+DFFF. Using two such codes it is possible to encode U+10000 to U+10FFFF.

This means 💩 is actually coded as two 16 bit sequences, 0xD83D 0xDCA9 in SMS!

Why does this matter, I mean, who sends 💩 by SMS? As you can imagine, in the early 90's nobody had heard of 💩, and the best emojis we had were :-)

But we do care, honest, as we use it as a blue* M&M test for carriers we deal with. If they have enough attention to detail to handle a pile of poo they probably have the rest sewn up, technically. We are working with a new carrier for SMS messages, and I am pleased to say the unicode is working. They properly translate to/from UTF-8 coding in the messages we exchange (which is what we use internally). Unlike our previous carrier who could not cope. (* see comments)

We have seen a range of such failures, even the case where one carrier could not handle an @ symbol (presumably as it coded to 0x00 which is an end of string in languages like C). Thankfully that carrier was happy for us to send a raw hex TPDU for SMS, and hence allowing us to code any characters. Our SIP2SIM service has handled pile of poo since we launched it...

The end result is that, shortly, we will be handling a lot more SMS with unicode characters correctly, in most cases, both incoming and outgoing. Watch this space.

12 comments:

  1. GSM puts a printable character in code 0x00? Well that was a really dumb move, C was popular even back when GSM was specified.

    ReplyDelete
  2. GSM puts a printable character in code 0x00? Well that was a really dumb move, C was popular even back when GSM was specified.

    ReplyDelete
  3. GSM puts a printable character in code 0x00? Well that was a really dumb move, C was popular even back when GSM was specified.

    ReplyDelete
  4. GSM puts a printable character in code 0x00? Well that was a really dumb move, C was popular even back when GSM was specified.

    ReplyDelete
  5. GSM puts a printable character in code 0x00? Well that was a really dumb move, C was popular even back when GSM was specified.

    ReplyDelete
  6. GSM puts a printable character in code 0x00? Well that was a really dumb move, C was popular even back when GSM was specified.

    ReplyDelete
  7. Posting comments doesn't appear to work on iPad theres days. Oh well.

    ReplyDelete
    Replies
    1. "Comments are moderated purely to filter out obvious spam, but it means they may not show immediately." ☺

      Delete
  8. No, nothing was appearing in my browser. Normally something pops up saying the comment has been posted. This was just hanging.

    ReplyDelete
  9. And it did it again, no hang this time but nothing obviously saying the comment had been posted either.

    ReplyDelete
  10. Blue M&M test? I know of the brown one: https://www.snopes.com/fact-check/brown-out/

    ReplyDelete
    Replies
    1. Interesting. The story must have morphed before it got to me.

      Delete

Comments are moderated purely to filter out obvious spam, but it means they may not show immediately.