OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

cti message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: Re: [cti] Unicode Normalization in Pattern Expressions


Interesting - it's saying to perform that normalization on the regular _expression_, mind, but it's unclear why (probably in order to avoid [é] matching two characters instead of one, but it'd be nice to know).

As I say, I'm much less concerned about which normalization form of NFC and NFD is used, than I am that *some* normalization form is always used.

I suspect that NFC would have advantages with Japanese composable characters - at least one Japanese script uses very heavy composition which might lead to confusion with regular _expression_ and substring searches. Let's call me convinced that NFC is the right solution.


On 27 Oct 2016 18:54, "Jason Keirstead" <Jason.Keirstead@ca.ibm.com> wrote:

I am just getting up to speed on all of this - but I would recommend we do NFC normalization. This is because it is how Unicode says to do regular expressions. Link:

http://www.unicode.org/reports/tr18/#Hex_Notation_and_Normalization "

      1.1.1 Hex Notation and Normalization

      The Unicode Standard treats certain sequences of characters as equivalent, such as the following:

      u + graveU+0075 ( u ) LATIN SMALL LETTER U +
      U+0300 ( ◌̀ ) COMBINING GRAVE ACCENT
      u_graveU+00F9 ( ù ) LATIN SMALL LETTER U WITH GRAVE

      Literal text in regular expressions may be normalized (converted to equivalent characters) in transmission, out of the control of the authors of of that text. For example, a regular _expression_ may contain a sequence of literal characters 'u' and grave, such as the _expression_ [aeiou◌̀◌́◌̈̈] (the last three character being U+0300 ( ◌̀ ) COMBINING GRAVE ACCENT, U+0301 ( ◌́ ) COMBINING ACUTE ACCENT, and U+0308 ( ◌̈ ) COMBINING DIAERESIS. In transmission, the two adjacent characters in Row 1 might be changed to the different _expression_ containing just one character in Row 2, thus changing the meaning of the regular _expression_. Hex notation can be used to avoid this problem. In the above example, the regular _expression_ should be written as [aeiou\u{300 301 308}] for safety.

      A regular _expression_ engine may also enforce a single, uniform interpretation of regular expressions by always normalizing input text to Normalization Form NFC before interpreting that text. For more information, see UAX #15, Unicode Normalization Forms [UAX15].

        ).


TL;DR - Either leave it open to interpretation and require Regex authors to be explicit, or use NFC. The text of the TS actually does not allow you to do another form.

It's also therefore how the Unicode reference implementations ICU and ICU4J do it, so I doubt we want to deviate from that, lest people have to go off and write their own Unicode compliant Regex engines.


-
Jason Keirstead
STSM, Product Architect, Security Intelligence, IBM Security Systems
www.ibm.com/security | www.securityintelligence.com

Without data, all you are is just another person with an opinion - Unknown


Inactive hide details for Dave Cridland ---10/27/2016 05:12:58 AM---Hi folks, I was about to write en extensive essay in a GoogDave Cridland ---10/27/2016 05:12:58 AM---Hi folks, I was about to write en extensive essay in a Google Doc comment, then

From: Dave Cridland <dave.cridland@surevine.com>
To: cti@lists.oasis-open.org
Date: 10/27/2016 05:12 AM
Subject: [cti] Unicode Normalization in Pattern Expressions
Sent by: <cti@lists.oasis-open.org>




Hi folks,

I was about to write en extensive essay in a Google Doc comment, then thought that might not be such a great medium...

TLDR: We should do normalization throughout (high F-number), and probably NFD (lower F-number).

So the current position of the Pattern Expressions language is that MATCHES performs NFC normalization, whereas other operators ("=", "!=", etc) do not.

The stated reason is that it may be useful to match the original codepoints used.

However, this exact same reason was demolished when I suggested we might want to maintain the original form in CyBOX^WSTIX Cyber Observations, since email is the last bastion of codepages. Email, for example, mostly handles text strings as a 3-tuple of Language, Encoding, Octets, (sometimes sequences of those tuples) and while you can [nearly] always transform that to Unicode codepoint sequences, you do lose two key items of information:

* Language, so one cannot now render Japanese and Chinese correctly (the "CJK problem").
* Original encoding.

The problem also exists with almost any "native" unicode data such as UTF-8, but to a lesser degree. Unicode compliant codepoint sequence handling allows applications to freely translate between equivalent forms.

Consider filenames. On Linux, filenames are not normalized - although Linux tends to use composed characters throughout (so, probably, NFC), so this problem isn't very significant - on the other hand filenames can be ISO-Latin-15 if the whim takes you, and CyBOX^WCyber Observations doesn't support that, so we cannot represent that data. Supporting non-normalized data doesn't help here - we don't know which form ISO-Latin-15 would translate to.

On Windows, filenames are always unpacks to UCS-2, but then leaves the normalization alone. Applications appear to routinely perform either NFC or NFKC, I'm not sure which, but they needn't.

On a Mac, however, it's always NFD - all filename accesses are normalized at the OS level.

If that's not all bad enough for you, the Pattern Expressions and the STIX objects are all UTF-8, and either might be stored in a database somewhere - which will almost certainly perform normalization. It's hard to tell which form, but it's likely to be NFC or NFD, since those are fully equivalent (whereas the K forms are not).

So accessing data as specifically un-normalized unicode is unlikely to work in many cases.

If you want to access the original form used by an attacker - and I agree (and even commented against the CyBOX^WWhatever email objects) that this seems useful, then you'll need to preserve the original octets used as a binary object, and not a Unicode string, and match them encoding and all. This would let you match, for example, the difference between café and café, which will likely have been normalized despite my best efforts: 'caf\xc3\xa9' and 'cafe\xcc\x81' are NFC and NFD forms respectively. Note that there exist many non-canonical forms of some strings; Japanese in particular decomposes most ideograms into a number of constituent codepoints, and using these in a non-canonical ordering might also be significant.

If we do want to do this, however, we also need to decide how to handle:
* Email, where we've already thrown away that information.
* Filesystems, where in some cases the applications will normalize, and in others the filesystem will normalize, and in some cases nobody normalizes.

On the other hand, if we decide that normalizing Unicode is much better for our sanity, then we need to pick a normalization form.

NFC has been the general favourite for years, because it matches how ISO-Latin-1 handles accented characters; in particular the "SMALL LATIN E WITH ACUTE ACCENT" (I think) of "café" is a single codepoint, like the ISO charsets of yore. But to normalize to NFC, you need to first decompose everything, then canonically compose.

NFD, on the other hand, is closer to the "unicode way", and decomposes all characters to their parts. This means that you end up with "cafe", followed by a combining acute accent - neat, because if you search for "cafe" you'll find it, though this is something of an edge case. It is, however, much faster to implement (you can, by which I mean even I can, implement it at the UTF-8 level without decoding to UCS-4, which makes it even faster).

Which you want to pick is largely up to you... However in wildcard strings it might be safer to use NFD. I suspect that "_" and/or "%" may compose under some circumstances. Certainly 'caf_\xcc\x81' gets represented as caf_́ quite happily, and wouldn't match an NFC string. I can't find any sequences that canonically compose, mind, so this might not be an issue.

Dave.
--

Dave Cridland

phone  +448454681066
email  dave.cridland@surevine.com
skype  dave.cridland.surevine

Participate | Collaborate | Innovate

Surevine Limited, registered in England and Wales with number 06726289. Mailing Address : PO Box 1136, Guildford GU1 9ND
If you think you have received this message in error, please notify us.




[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]