2013-08-17

punycode

Proper rant this time.

I have been tinkering with EPP and domain registration this week - as Nominet had changed their EPP schemas and broken our tools a little while back and so I had to update things. It is not that bad a system, using XML, but heavily in to use of namespaces (which is always fun with XML). For my next trick I have to try and use EPP for the com/net/org and other TLDs - which is different yet again.

But this did get me, once again, annoyed at punycode.

So, a bit of basic background here first.

For a long time computers used very basic western characters. There were a number of character sets and encoding systems. In early computers there were 5 bit bytes, and 6 bit bytes, and all sorts. It is interesting to see some of the stuff in the National Museum of Computing where they have 5 hole paper tape running through the first computers even created. The basic trick is to map letters and numbers and symbols on to binary codes (which are usually written as numbers in decimal or octal or hexadecimal).

A standard did emerge, ASCII, the American Standard Code for Information Interchange. This was 7 bits (so 128 combinations) and mixed up the concept of simple coding for characters with controls for things like a teletype - so ASCII includes carriage return and line feed and even characters like bell which rings the bell of the teletype. It also included symbols for start and end of blocks and records which was used on mag tape and the like. But the main body of ASCII is the coding for letters and numbers and symbols commonly used in America.

This, of course, ran in to problems quite quickly as we are not all American. Some characters caused constant confusion including the UK currency £ symbol. We would call it a pound symbol, but the word pound in American was used for the # symbol, what we call hash. That alone caused confusion when simply discussing characters by name.

We also have the problem with all these EU countries which use a mostly American character set but have accents on their characters. And then you get the Greeks thrown in to the mix.

The fact that computers were, by then, commonly using 8 bit bytes meant that one could use ASCII for half of the character space, and something else for the rest. This led to a whole range of ISO character sets and many non standard character sets, which used the top 128 codes for different purposes, including various EU accented characters, and, of course, symbols so you can draw boxes and basic graphics.

Thankfully we now have a more universal system for numbering the symbols and letters and characters we us, UNICODE. It does not try and fit characters in to 8 bits. In fact it allows a lot more, but most characters fit in under 16 bits. UNICODE does not just do the accented EU characters, and Greek, but Chinese, Japanese, lots of graphics and symbols, and even Klingon, though these do not all fit in 16 bits. The problem then is how you represent these on a computer. Some went for 16 bit bytes, or wide characters which was common on Windows. Another approach, which is much more common on unix based systems, and standard for coding things like XML, is UTF-8.

UTF-8 uses a very simple trick which makes it compatible with a lot of systems which are not expecting anything special in terms of characters. The first 128 characters are normal ASCII, but the next 128 codes are used as a sequence of bytes to represent UNICODE characters. The lower numbered codes fit in two bytes, and higher fit in three bytes, and so on. There are some nice properties of UTF-8, for example, a normal byte by byte comparison will still compare two strings correctly as higher or lower alphabetically based on the UNICODE characters (i.e. the same order as using 16 or 32 bit bytes for the same UNICODE characters). Also, by ignoring a specific block of 64 bytes one can count how many characters are in a string. There is no use of NULL in the special coding, so strings can still end with a NULL (as used in C and some other languages). Indeed, the special coding never clashes with ASCII, so searching for an ASCII character will always find the character you are looking for and not part of a special character coding. UTF-8 is nice. I like UTF-8. It should be the standard for all character coding. It is the default for many systems now (like XML).

Then we get to punycode. It just annoys the hell out of me, and I have yet to see a good reason for it.

Basically the idea is to allow international domain names, i.e. using some of these nice UNICODE characters in domains. This is, in principle, a great idea, as domain names are even more restricted than ASCII, only using letters, numbers and selected use of a hyphen.

But think how most systems use a domain name - they may possibly parse out the domain name, e.g. what is between an @ and end of line, or > character in an email address, or what is between http:// and / on a URL. Typically the parsing is looking for a standard ASCII delimiter and not really taking much notice of the characters in the domain part. When used as a command line argument the delimiter may be simply a space. The application then passes this to the machines DNS resolution libraries.

It is really useful that the applications are not looking for anything within the domain/host name as application writers are notoriously bad at making their own syntax checks match the RFCs or keeping them up to date. The number of times my valid email addresses are rejected by some system is just crazy.

So the DNS library may do some checks on the domain passed - for a start, telling if it is an IP address literal or a domain name. But it will then, typically, just look for the dot delimiters and code a request using the DNS protocol. The protocol has no problem with any characters at all within the parts of a domain name, and even allows NULL, and even a dot, within the parts of a domain! It certainly has no issue whatsoever with UTF-8 coding.

So, most applications would parse out a domain name, pass to a library, which sends to a name server. There are very few name servers in use - bind is perhaps one of the most common - and these could easily be made to handle UTF-8 coding if necessary (by simplifying or removing sanity checks they have in place now). In practice, old versions of these resolvers were quite happy with unexpected characters, and already have to cope with characters outside the normal domain set such as underscore used for SRV records.

So with very little tweaking, and in fact no tweaking at all in many cases, most applications, libraries and resolvers could handle UTF-8.

I actually tested some older browsers and applications and they did just this - parsing out the domain with unicode in it, passing to the library which passed to the caching resolver which passed to the authoritative resolver. It just worked.

But no, this was not to be. Instead, someone, decided this was a bad idea. They decided that we should force UNICODE in to the letters/number/hyphen format for domains. Resolvers actually got updated to add extra checks. And we have this crazy system where special characters in domains are coded as a string starting xn-- and using only letters/numbers and hyphens as per normal domain names.

This means every application has to be updated to handle punycode. It is not just done at the resolver library, it is done at the application layer, both parsing and coding of strings in protocol messages, and displaying these domains to the end user. It is horrid and messy and there really is no good reason for it.

At the end of the day the decision was (a) minor change to a very limited number of libraries and resolvers, or (b) change every single application that uses domains, as well as the libraries and resolvers. It seems we went for the latter. Arrrg!

Just to tackle the obvious comment I will get - there is good reason for registries to limit the combinations of characters they allow to be registered for UNICODE based domain names. There are lots of symbols which look identical to normal western characters and can create domains that look identical to trusted companies domains and can be used for fraud and scams. But that make sense whether the system used UTF-8 or punycode to do it.

Anyway, I should have http://☺.aa.net.uk/ working now... But your browser has to convert ☺ in to xn--74h

15 comments:

  1. That's just silly, expecting it to be done by the applications.. So I can't just do a gethostbyname on ☺.aa.net.uk and expect it to work?

    /headdesk

    ReplyDelete
    Replies
    1. Correct. though, I suppose, just maybe, some DNS libraries may punycode it for you, but then usage in protocols like email have to use the xn-- punycode format, so the lookup is not the whole story, sadly.

      Delete
  2. There are plenty of good reasons for Punycode. Whilst DNS is technically capable of using 8-bit "names", DNS names are case-insensitive, so special handling would have been required at the DNS layer to prevent confusion with any of the bytes representing ASCII letters if they appeared as subsequent bytes in a UTF-8 code point. Also, some scripts support "right to left" instead of "left to right" display, but this is a presentation issue best resolved at the application layer.

    ReplyDelete
    Replies
    1. That does not justify punycode, and in-fact supports my argument against it. If special handling is needed, it is far better for that to be done in one place, in the very few different codebases of DNS authoritative servers that exist than in every application ever written that works with domain names. If I do a dig on AA.net.uk, then the DNS request sent is AA.net.uk not aa.net.uk, and the DNS resolver does the case insensitive handling now. Whatever rules are needed for IDN, which can be a restricted set of characters specifically supported in IDN, can be done at the DNS server just as they are for normal western characters being case insensitive.

      Delete
    2. Amongst the many rules of thumb in Internet protocol design:
      1) There's lots of broken stuff out there - the protocol spec might say something should work, but if there are enough implementations that don't do it right you'll have a real problem in deploying a new protocol or protocol extension that relies on them doing the right thing.
      2) Upgrading things is hard - it might be a "simple change", but if that's in enough deployed kit, it won't happen (or, at least, only very slowly and you'll need to remain backward compatible).

      In this case, the things I would tend to worry about would be broken resolver implementations in devices such as DSL routers. There are sure to be a good number of those that mangle non-7-bit queries or responses (look at the studies on trying to get DNSSEC or EDNS0 working for similar examples where things should work, but in reality implementations are broken). There are also lots of places that DNS names are used in protocols other than DNS itself. There are sure to be some of those that don't handle 8-bit names properly.

      In an ideal world maybe punycode wouldn't have been needed. But on the real Internet, unfortunately, providing backward compatibility with slightly broken implementations that can't easily be upgraded is a necessary engineering compromise.

      (Interestingly, one way of trying deal with the broken DSL router issue is the "Designed for Window 7" sticker. To be allowed to use the sticker you have to handle things like TCP window scaling properly, so forcing manufacturers to fix such bugs.)

      Delete
    3. That is a tricky argument - and suggests that every thing we do cannot rely on what was done before, and every RFC should re-invent IP, for example. Yes, some concern over dodgy implementations on DSL routers are a possible issue as they are one of the few places that DNS resolvers happen and are not easily updated. However, they are also (a) replaced every few years anyway and IDN has taken long enough to come about, (b) not, IMHO, that likely to be an issue as simpler and older implementations are the ones that appear to "just work" and would have to have coded unnecessary extra checks in order to not work, and (c) are typically made in countries that actually have an interest in IDN working. Punycode is not just expecting that people will replace their DSL routers within a few years, it is expecting them to replace everything else, PCs, laptops, tablets, phones, and so on.

      Delete
    4. It's certainly not a definitive argument, but is one of those things that have to be considered in the mix when doing protocol design for use on the Internet.

      I suppose the main thing I was trying to get across was that saying "The specification for Protocol X supports Feature Y, therefore we can rely on Feature Y when doing new stuff." (in this case, DNS with 8-bit values) isn't always sufficient in the real world, particularly where Feature Y is some rarely used aspect (so may often be unimplemented or broken). (If it was the case that you could make this assumption, I could have some of the weeks of my life back spent thinking about protocol traversal across NATs...)

      However, returning more specifically to punycode (as bits of history are dragged up from the depths of my brain): I believe the other argument I touched on may be more important for the "why punycode?" discussion. Not all protocols that contain "host names" are 8 bit clean. Mail (even in RFC5322) is a good example, hostnames in SIP URIs (RFC3261) is another. So we need a representation of DNS names in a restricted (ASCII-subset) character set. (Assuming that forcing all mail receiving systems to be upgraded to a new version of which supports Unicode hostnames isn't an option.)

      A RFC5322 mail server is going to get grumpy on receiving an email from revk@☺.aa.net.uk. We could try making it a mail-specific thing, where your internationalization(i10n) aware mail system converts the domain of your email address to, say, zz--smiley-face.aa.net.uk before sending it to me, but how do I on my i10n-unaware mail system send mail back to you? To do that my mail server needs to be able to look up the MX record for zz--smiley-face.aa.net.uk - so we need something punycode-like that is carried by the DNS protocol. (We could then look at making an argument for supporting both the 'encoded' and 8-bit forms, but it looks like we need the 'encoded' one as a minimum.)

      Delete
  3. http://www.w3.org/International/articles/idn-and-iri/#why

    ReplyDelete
    Replies
    1. That says why IDN but not why punycode. Even the issue of "how would you type" a japanese URL does not help, as knowing what xn-- name to type is not obvious. That text even says: In practice, it makes sense to register two names for your domain. One in your native script, and one using just regular ASCII characters.

      Delete
  4. As far as I remember, the excuse for Punycode was "UTF-8 is haaaaaaard". This was after all 2003, when some people were still trying to pretend that UTF-8 wasn't the winner of the Unicode encoding fights, and still using UTF-16 and other horrors.

    ReplyDelete
  5. "Anyway, I should have http://☺.aa.net.uk/ working now... But your browser has to convert ☺ in to xn--74h"

    Actually it doesn't, because you (or Blogger) already did it. The URL in the link is http://xn--74h.aa.net.uk/

    I did try putting http://☺.aa.net.uk/ into my address bar. It worked fine.

    ReplyDelete
  6. Hilariously, when I try accessing http://☺.aa.net.uk/ from the BT intranet, it translates it to http://xn--74h.aa.net.uk/ and then says:

    Network Error (dns_unresolved_hostname)
    Your requested host "☺.aa.net.uk" could not be resolved by DNS.
    Your request was categorized by Blue Coat Web Filter as 'Computers/Internet'.

    Too much Internet.

    ReplyDelete
    Replies
    1. Works fine through the Barracuda filter at one of my sites ;)

      Delete
    2. Sounds like we (actually not we, more anybody but we) will have fun with smiley and other unicode URL once the power to be will impose their great internet "modesty cover"/filter...
      How do you write pr0n in japanese?

      Delete
  7. Oh now i really have to set up some smiley face subdomains!

    ReplyDelete

Comments are moderated purely to filter out obvious spam, but it means they may not show immediately.

Pubbing it

I used to go to the pub once or twice a week, a few years ago, not so much since COVID. What is weird is finding myself going to the pub alm...