So, just started looking at the HMRC RTI stuff. I mean, how hard can it be...
The first step is that to send the data to HMRC you have to go via the government gateway. The whole lot is XML, which is, itself, not an issue. XML is fine, it is the way people use XML that drives one mad.
So, gov gateway uses a SOAP style envelope posted via https - simple stuff. The envelope includes login credentials which can be plain text, MD5 or proper certificate signed. MD5 is not much better than plain text as it can be replayed, but does not expose the original password. In any case it is all over https, so not too bad either way.
The MD5 is interesting as the steps are (a) convert to lower case and UTF-8, (b) Make MD5, (c) encode as base64. So I wonder, why make the password case insensitive? Who does that? Then I pondered more - UTF-8 means they understand it could be non ASCII. Do I have to convert an upper case Omega to lower case? What of accented latin characters, or latin characters with accent modifier characters. Doing all that would be a lot of work. I decided not to bother asking and just use tolower() and hope for the best :-)
Then we get to the actual payload, a single XML object within the in the envelope. Again, standard stuff, but for HMRC this needs a hash. This time it is an SHA1 and base64 coded. This would not be too bad if it was an SHA1 of the data as sent, but there are some gotchas.
For a start, the hash is within the message, so you have to make the message, then make a hash, then remake the message with the hash within it. You have to be sure of making the object the same both times (which is fine unless your code uses counts of namespaces usage to control local tagging, etc, and adding one more item could change things).
What is worse though is the output has to be canonical XML. This would be fine if it was just textual normalisation, but it is not - you have to order the attributes in the elements alphabetically. Well, again, not too bad until you realise the order has to use the original URI from the namespace prefixes to order. This means full XML parsing in order to normalise the output. This is fine except you have to retain all the extraneous spacing which my XML parser does not do, as spacing between elements is used in the hash. Anyway, as we are only generating this we can generate normalised canonical XML in the first place and so make a hash.
So far so good. You make a hash of the HMRC message, embed in the message, and then pass to the gov gateway handling code to include in an envelope and send. Well, no! The hash is not over the HMRC message (i.e. from start of <IRenvelope ...> to end of </IRenvelope>) it is over the encompassing element in the sending envelope. This means the embedded IRmark hash has to be worked out in light of the sending protocol. It is just adding <Body> to start and </Body> to end, in theory, except it has to know the spacing that the wrapper will use between <Body> and <IRenvelope>. Also, just for fun, the <Body> is not as sent, but with the xmlns needed from its parent envelope. In short you can only add the IRmark in the envelope handling code so mixing the logic between layers - messy.
Anyway, once you get passed all that, there is another hash. This time for the BACS payment, but not MD5, or SHA1, no, this time SHA256. And not base64 coded, but lower case hex coded. Just to add more to the mix.
Anyway, now to integrate with payroll...
Update: This is all simple stuff, and HMRC were keen for us to be ready by 6th April, but they are closed! Their systems are down for maintenance! Arrrg.