The V3 to V2 key conversion algorithm documented in 10.1.1 of the 3.0.1

Subject: The V3 to V2 key conversion algorithm documented in 10.1.1 of the 3.0.1 standard

As an exercise in testing clarity I thought I'd try implementing the algorithm outlined in 10.1.1. My findings are that it's incompletely specified, and could do with more detail.

The first step specifies the use of the "the bytes of the normalized form" of the key. Sounds simple enough...

Comment: The "normalized form" of the key is documented in section 4.4 - it might be worth pointing at that section from this algorithm. It might also be a good thing to include in the glossary (it isn't there). BTW: section 4.4 refers to a tech report (http://www.unicode.org/unicode/reports/tr21/) which has been superseded by Unicode 4 - should we update our reference?

Problem: what is meant by "the bytes"? I assume this means the bytes of the Unicode representation (given that we are using Unicode), which means that we must worry about endian issues, given that each Unicode character is two bytes, and MD5 operates on bytes rather than characters. Are we requiring a big-endian or little-endian representation? Or are we feeding UTF-8 into the hash? Does this mean we will have issues with UTF-16? We really should specify what is meant by "the bytes".

Problem: is the "uddi:" prefix on the key included in the bytes to be hashed? Or do we hash just the portion after that prefix? There's no statement either way, and the fact that the "uuid:" prefix must be added afterwards for tModel keys adds to the confusion.

Problem: I was very confused by the discussion of endian forms in the second step - I assumed that they were only relevant in considering the data to be converted in the third step, because the MD5 hash outputs bytes (by my reading of it, anyway). If that's the case, then we might be well advised to drop the reference to the document and state explicitly which byte goes where in the final result - it would make things simpler for implementors of this algorithm. If we were to say, for example, that the first two characters of the output are the hex representation of byte[3] of the MD5 hash (that's my reading of it), then there's no confusion. Going from the bytes of the MD5 hash across to the pseudo-words of the UUID format, then back to the bytes that correspond to the hex string, seems unnecessary. (Should a coder care to implement it using words, then they can look out for endian issues for themselves)

I am not getting the right values out of my implementation yet, so I'm certain I haven't all the answers - given that I'm far from a novice coder, I think this clearly indicates that we have work to do on this section.

Tony (Troublemaker) Rogers

tony.rogers@ca.com

uddi-spec message