Unicode Normalization in Pattern Expressions

Hi folks,

I was about to write en extensive essay in a Google Doc comment, then thought that might not be such a great medium...

TLDR: We should do normalization throughout (high F-number), and probably NFD (lower F-number).

So the current position of the Pattern Expressions language is that MATCHES performs NFC normalization, whereas other operators ("=", "!=", etc) do not.

The stated reason is that it may be useful to match the original codepoints used.

However, this exact same reason was demolished when I suggested we might want to maintain the original form in CyBOX^WSTIX Cyber Observations, since email is the last bastion of codepages. Email, for example, mostly handles text strings as a 3-tuple of Language, Encoding, Octets, (sometimes sequences of those tuples) and while you can [nearly] always transform that to Unicode codepoint sequences, you do lose two key items of information:

* Language, so one cannot now render Japanese and Chinese correctly (the "CJK problem").

* Original encoding.

The problem also exists with almost any "native" unicode data such as UTF-8, but to a lesser degree. Unicode compliant codepoint sequence handling allows applications to freely translate between equivalent forms.

Consider filenames. On Linux, filenames are not normalized - although Linux tends to use composed characters throughout (so, probably, NFC), so this problem isn't very significant - on the other hand filenames can be ISO-Latin-15 if the whim takes you, and CyBOX^WCyber Observations doesn't support that, so we cannot represent that data. Supporting non-normalized data doesn't help here - we don't know which form ISO-Latin-15 would translate to.

On Windows, filenames are always unpacks to UCS-2, but then leaves the normalization alone. Applications appear to routinely perform either NFC or NFKC, I'm not sure which, but they needn't.

On a Mac, however, it's always NFD - all filename accesses are normalized at the OS level.

If that's not all bad enough for you, the Pattern Expressions and the STIX objects are all UTF-8, and either might be stored in a database somewhere - which will almost certainly perform normalization. It's hard to tell which form, but it's likely to be NFC or NFD, since those are fully equivalent (whereas the K forms are not).

So accessing data as specifically un-normalized unicode is unlikely to work in many cases.

If you want to access the original form used by an attacker - and I agree (and even commented against the CyBOX^WWhatever email objects) that this seems useful, then you'll need to preserve the original octets used as a binary object, and not a Unicode string, and match them encoding and all. This would let you match, for example, the difference between café and café, which will likely have been normalized despite my best efforts: 'caf\xc3\xa9' and 'cafe\xcc\x81' are NFC and NFD forms respectively. Note that there exist many non-canonical forms of some strings; Japanese in particular decomposes most ideograms into a number of constituent codepoints, and using these in a non-canonical ordering might also be significant.

If we do want to do this, however, we also need to decide how to handle:

* Email, where we've already thrown away that information.

* Filesystems, where in some cases the applications will normalize, and in others the filesystem will normalize, and in some cases nobody normalizes.

On the other hand, if we decide that normalizing Unicode is much better for our sanity, then we need to pick a normalization form.

NFC has been the general favourite for years, because it matches how ISO-Latin-1 handles accented characters; in particular the "SMALL LATIN E WITH ACUTE ACCENT" (I think) of "café" is a single codepoint, like the ISO charsets of yore. But to normalize to NFC, you need to first decompose everything, then canonically compose.

NFD, on the other hand, is closer to the "unicode way", and decomposes all characters to their parts. This means that you end up with "cafe", followed by a combining acute accent - neat, because if you search for "cafe" you'll find it, though this is something of an edge case. It is, however, much faster to implement (you can, by which I mean even I can, implement it at the UTF-8 level without decoding to UCS-4, which makes it even faster).

Which you want to pick is largely up to you... However in wildcard strings it might be safer to use NFD. I suspect that "_" and/or "%" may compose under some circumstances. Certainly 'caf_\xcc\x81' gets represented as caf_́ quite happily, and wouldn't match an NFC string. I can't find any sequences that canonically compose, mind, so this might not be an issue.

Dave.

Dave Cridland

phone +448454681066

email dave.cridland@surevine.com

skype dave.cridland.surevine

Surevine

Participate | Collaborate | Innovate

Surevine Limited, registered in England and Wales with number 06726289. Mailing Address : PO Box 1136, Guildford GU1 9ND

If you think you have received this message in error, please notify us.

cti message