I have managed to get the pile of poo to correctly display on my iPhone as an incoming SMS text (i.e. using normal GSM SMS not iMessage or some such).
This is actually quite a milestone. There are various gateways to send texts but they all seem to have limitations or ways in which they translate to/from the GSM SMS protocol.
We can usually manage to handle multi-part (i.e. very long) texts, just about. Most of the time we can even handle something called a User Data Header (UDH) which is extra binary data sent with the message.
Getting UDH right is actually crucial for iMessage registrations to work at all. Otherwise you iPhone would not believe you had the number you have (it sends a text and expects a response that has a UDH).
Getting those key things to work is hard enough, but character set coding is a nightmare. This is because texts can be sent in one of three character sets.
- GSM 7 bit character set. This has 128 characters, which include the normal letters (A-Z,a-z) numbers, punctuation, and a load of accented characters as well as upper case Greek. A text can have 160 characters using this coding. There are then extra characters using ESC (escape) as a prefix to get things like a Euro symbol (using two characters). Even just getting the @ character to work can be a challenge as it is coded on character 00 and not its usual place which breaks some things.
- USC 8 bit characters - the first 256 unicode characters. You can have 140 of these in a text.
- USC 16 bit characters - the first 65536 unicode characters. You can have 70 of these in a text.
The big issue is most text gateways are ASCII or some such, and do not map to/from these character sets. Even when XML is used that handles UTF-8, teh systems rarely give enough attention to detail to translate characters correctly. We have taken the view that the only right way to do things is to use UTF-8 coding for our interfaces with customers for texts and for us to do the translations right! For this reason we have been nagging the mobile operator, and they have finally come through for us.
The good news today is that the low level raw interface has been opened up allowing texts to and from our voice SIMs to use any of these character coding and UDH.
But even with all of that, the Pile of poo is extra special. It is 1F4A9 which is too big even for UCS16 coding. The trick is to use UTF-16 to use two of the UCS16 codes (total 32 bits) to code it. To my utter surprise this actually works and iPhones handle it!
We are gradually integrating various aspects of our new texting system now. The clean interface to and from our mobile SIMs is a really good start. If we can get other mobiles and even land line numbers all integrated more seamlessly, that will be even better.
Are those surrogate characters in the title, because I see two ORCs (object replacement characters, but I love the unintentional acronym!) instead of one?ReplyDelete
On a side note, if 09F9 was an illegal number, is 1F4A9 a (mildly) profane number now?
Those aren't ORCs, the Object Replacement Character looks like a dotted box, it's code is U+FFFC and means roughly "A thing was supposed to be here, but it couldn't be represented as text". What you're seeing (or at least, should be seeing) is U+FFFD Replacement Character which appears as a diamond with an inverse question mark symbol and means roughly "A character was supposed to be here, but some sort of error occurred". Unicode specifies that when something goes wrong while processing Unicode data and real error handling (e.g. throwing a Java Exception) is not possible each code unit causing an error should be replaced by U+FFFD instead. This prevents many text processing bugs from becoming security bugs instead.Delete
Yes, the post title seems to be two surrogate characters (which are invalid characters in UTF-8, the page encoding).ReplyDelete
If I put a pile of poo into this comment and hit 'Preview' and then 'Edit' then Blogger gives me two surrogate characters so I suspect it might simply be broken, but let me try just posting without editing again.. 💩
The email sent to me to approve the post had proper utf-8 pile-of-poo characters, but looks like blogger is being rather odd on this on the web page even for comments. Strange.Delete
Upon downloading the page with wget and then looking at a hexdump, it's actually serving up & # 55357; & # 56489; without the spaces, so there's no browser funkiness going on here.ReplyDelete
You would think if Blogger's going to go to the effort of replacing Unicode characters with escapes, it would be smart enough to recognise surrogates too!
Well yes, but it is it that generated the surrogates - I posted a UTF-8 character. Annoying.Delete
Aha! Well this post spectacularly killed my RSS reader. ttrss failed to insert the record into MySQL so that gives me something to look into!ReplyDelete
I don't know about TTRSS, but I can help with MySQL... UTF-8 is composed of 17 planes. The first plane contains most of the characters for existing languages, so MySQL (and I suspect quite a few other programs) has implemented only the first plane and calls that UTF-8, which is wrong wrong wrong. OK, the 16 remaining plane contain mostly dead language stuff, but that's also where Pile of Poo is, and who can live without that?Delete
What you want to use is what MySQL calls 4 byte UTF-8 (utf8mb4), which is really the bona fide UTF-8 with a fancy name. To activate that across the board when I first installed MariadDB (it should be the same with MySQL), I created a .cnf file in mymysql/conf.d/ folder, containing the following:
character-set-client-handshake = FALSE
character-set-server = utf8mb4
collation-server = utf8mb4_unicode_ci
I suspect you'll have to look into converting your current tables before using the above, but at least that should give you a good starting point.
It arrived in my feed reader correctly encoded...ReplyDelete
I suspect what has happened is somehow you've gotten two surrogates UTF-8 encoded in your post. Somewhere along Blogger's E-Mail chain, and somewhere along the route to my feed reader, some software has converted these to UTF-16 using a non-validating parser. At this point, the surrogates have correctly gotten shoved together in UTF-16. When they came back out, well, they came back out as valid UTF-8.
This could quite easily happen if there was, say, some Python or Java in between the two.