RevK®'s ramblings: punycode

2013-08-17

punycode

Proper rant this time.

I have been tinkering with EPP and domain registration this week - as Nominet had changed their EPP schemas and broken our tools a little while back and so I had to update things. It is not that bad a system, using XML, but heavily in to use of namespaces (which is always fun with XML). For my next trick I have to try and use EPP for the com/net/org and other TLDs - which is different yet again.

But this did get me, once again, annoyed at punycode.

So, a bit of basic background here first.

For a long time computers used very basic western characters. There were a number of character sets and encoding systems. In early computers there were 5 bit bytes, and 6 bit bytes, and all sorts. It is interesting to see some of the stuff in the National Museum of Computing where they have 5 hole paper tape running through the first computers even created. The basic trick is to map letters and numbers and symbols on to binary codes (which are usually written as numbers in decimal or octal or hexadecimal).

A standard did emerge, ASCII, the American Standard Code for Information Interchange. This was 7 bits (so 128 combinations) and mixed up the concept of simple coding for characters with controls for things like a teletype - so ASCII includes carriage return and line feed and even characters like bell which rings the bell of the teletype. It also included symbols for start and end of blocks and records which was used on mag tape and the like. But the main body of ASCII is the coding for letters and numbers and symbols commonly used in America.

This, of course, ran in to problems quite quickly as we are not all American. Some characters caused constant confusion including the UK currency £ symbol. We would call it a pound symbol, but the word pound in American was used for the # symbol, what we call hash. That alone caused confusion when simply discussing characters by name.

We also have the problem with all these EU countries which use a mostly American character set but have accents on their characters. And then you get the Greeks thrown in to the mix.

The fact that computers were, by then, commonly using 8 bit bytes meant that one could use ASCII for half of the character space, and something else for the rest. This led to a whole range of ISO character sets and many non standard character sets, which used the top 128 codes for different purposes, including various EU accented characters, and, of course, symbols so you can draw boxes and basic graphics.

Thankfully we now have a more universal system for numbering the symbols and letters and characters we us, UNICODE. It does not try and fit characters in to 8 bits. In fact it allows a lot more, but most characters fit in under 16 bits. UNICODE does not just do the accented EU characters, and Greek, but Chinese, Japanese, lots of graphics and symbols, and even Klingon, though these do not all fit in 16 bits. The problem then is how you represent these on a computer. Some went for 16 bit bytes, or wide characters which was common on Windows. Another approach, which is much more common on unix based systems, and standard for coding things like XML, is UTF-8.

UTF-8 uses a very simple trick which makes it compatible with a lot of systems which are not expecting anything special in terms of characters. The first 128 characters are normal ASCII, but the next 128 codes are used as a sequence of bytes to represent UNICODE characters. The lower numbered codes fit in two bytes, and higher fit in three bytes, and so on. There are some nice properties of UTF-8, for example, a normal byte by byte comparison will still compare two strings correctly as higher or lower alphabetically based on the UNICODE characters (i.e. the same order as using 16 or 32 bit bytes for the same UNICODE characters). Also, by ignoring a specific block of 64 bytes one can count how many characters are in a string. There is no use of NULL in the special coding, so strings can still end with a NULL (as used in C and some other languages). Indeed, the special coding never clashes with ASCII, so searching for an ASCII character will always find the character you are looking for and not part of a special character coding. UTF-8 is nice. I like UTF-8. It should be the standard for all character coding. It is the default for many systems now (like XML).

Then we get to punycode. It just annoys the hell out of me, and I have yet to see a good reason for it.

Basically the idea is to allow international domain names, i.e. using some of these nice UNICODE characters in domains. This is, in principle, a great idea, as domain names are even more restricted than ASCII, only using letters, numbers and selected use of a hyphen.

But think how most systems use a domain name - they may possibly parse out the domain name, e.g. what is between an @ and end of line, or > character in an email address, or what is between http:// and / on a URL. Typically the parsing is looking for a standard ASCII delimiter and not really taking much notice of the characters in the domain part. When used as a command line argument the delimiter may be simply a space. The application then passes this to the machines DNS resolution libraries.

It is really useful that the applications are not looking for anything within the domain/host name as application writers are notoriously bad at making their own syntax checks match the RFCs or keeping them up to date. The number of times my valid email addresses are rejected by some system is just crazy.

So the DNS library may do some checks on the domain passed - for a start, telling if it is an IP address literal or a domain name. But it will then, typically, just look for the dot delimiters and code a request using the DNS protocol. The protocol has no problem with any characters at all within the parts of a domain name, and even allows NULL, and even a dot, within the parts of a domain! It certainly has no issue whatsoever with UTF-8 coding.

So, most applications would parse out a domain name, pass to a library, which sends to a name server. There are very few name servers in use - bind is perhaps one of the most common - and these could easily be made to handle UTF-8 coding if necessary (by simplifying or removing sanity checks they have in place now). In practice, old versions of these resolvers were quite happy with unexpected characters, and already have to cope with characters outside the normal domain set such as underscore used for SRV records.

So with very little tweaking, and in fact no tweaking at all in many cases, most applications, libraries and resolvers could handle UTF-8.

I actually tested some older browsers and applications and they did just this - parsing out the domain with unicode in it, passing to the library which passed to the caching resolver which passed to the authoritative resolver. It just worked.

But no, this was not to be. Instead, someone, decided this was a bad idea. They decided that we should force UNICODE in to the letters/number/hyphen format for domains. Resolvers actually got updated to add extra checks. And we have this crazy system where special characters in domains are coded as a string starting xn-- and using only letters/numbers and hyphens as per normal domain names.

This means every application has to be updated to handle punycode. It is not just done at the resolver library, it is done at the application layer, both parsing and coding of strings in protocol messages, and displaying these domains to the end user. It is horrid and messy and there really is no good reason for it.

At the end of the day the decision was (a) minor change to a very limited number of libraries and resolvers, or (b) change every single application that uses domains, as well as the libraries and resolvers. It seems we went for the latter. Arrrg!

Just to tackle the obvious comment I will get - there is good reason for registries to limit the combinations of characters they allow to be registered for UNICODE based domain names. There are lots of symbols which look identical to normal western characters and can create domains that look identical to trusted companies domains and can be used for fraud and scams. But that make sense whether the system used UTF-8 or punycode to do it.

Anyway, I should have http://☺.aa.net.uk/ working now... But your browser has to convert ☺ in to xn--74h

15 comments:

Tony HoyleSaturday, 17 August 2013 at 18:06:00 BST
That's just silly, expecting it to be done by the applications.. So I can't just do a gethostbyname on ☺.aa.net.uk and expect it to work?

/headdesk
ReplyDelete
Replies
Ray BellisSunday, 18 August 2013 at 07:41:00 BST
There are plenty of good reasons for Punycode. Whilst DNS is technically capable of using 8-bit "names", DNS names are case-insensitive, so special handling would have been required at the DNS layer to prevent confusion with any of the bytes representing ASCII letters if they appeared as subsequent bytes in a UTF-8 code point. Also, some scripts support "right to left" instead of "left to right" display, but this is a presentation issue best resolved at the application layer.
ReplyDelete
Replies
Cecil WardSunday, 18 August 2013 at 07:55:00 BST
http://www.w3.org/International/articles/idn-and-iri/#why
ReplyDelete
Replies
AnonymousMonday, 19 August 2013 at 09:45:00 BST
As far as I remember, the excuse for Punycode was "UTF-8 is haaaaaaard". This was after all 2003, when some people were still trying to pretend that UTF-8 wasn't the winner of the Unicode encoding fights, and still using UTF-16 and other horrors.
ReplyDelete
Replies
BWMonday, 19 August 2013 at 09:56:00 BST
"Anyway, I should have http://☺.aa.net.uk/ working now... But your browser has to convert ☺ in to xn--74h"

Actually it doesn't, because you (or Blogger) already did it. The URL in the link is http://xn--74h.aa.net.uk/

I did try putting http://☺.aa.net.uk/ into my address bar. It worked fine.

ReplyDelete
Replies
FawksieMonday, 19 August 2013 at 12:34:00 BST
Hilariously, when I try accessing http://☺.aa.net.uk/ from the BT intranet, it translates it to http://xn--74h.aa.net.uk/ and then says:

Network Error (dns_unresolved_hostname)
Your requested host "☺.aa.net.uk" could not be resolved by DNS.
Your request was categorized by Blue Coat Web Filter as 'Computers/Internet'.

Too much Internet.
ReplyDelete
Replies
John BurtonMonday, 19 August 2013 at 13:01:00 BST
Oh now i really have to set up some smiley face subdomains!
ReplyDelete
Replies

Add comment

Comments are moderated purely to filter out obvious spam, but it means they may not show immediately.

RevK^®'s ramblings

2013-08-17

punycode

15 comments:

Oh, Amazon, you are crazy

Rules

Rules

Report Abuse