[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: Unicode issues
All, We have previously discussed unicode issues for our string functions and the W3C working draft here: http://www.w3.org/TR/2005/WD-charmod-norm-20051027/ I posted some questions for clarification about this to their mailing list. http://lists.w3.org/Archives/Public/www-international/2008OctDec/0004.html It turns out that the specification does not meet our needs. After some thinking on the issues I have written up the following for the next working draft: A new section: --8<-- 7.1 Unicode issues In Unicode it is possible to represent some letters by different character sequences. The process of converting Unicode strings into canonical character sequences is called normalization. An operation is normalization-sensitive if its output(s) are different depending on the state of normalization of the input(s); if the output(s) are textual, they are deemed different only if they would remain different were they to be normalized. (Quoted from [CM]). An XACML implementation MUST NOT perform any normalization-sensitive operations unless it has ensured that the inputs are normalized. An XACML implementation MUST behave as if each normalization-sensitive operation normalizes the string into Unicode normalization form C. An implementation MAY use some other form of internal processing as long as the externally visible results are identical to this specification. For more information and specification of normalization forms see [UAX15]. --8<-- The references are: [CM] Character model model for the World Wide Web 1.0: Normalization, W3C Working Draft, 27 October 2005, http://www.w3.org/TR/2005/WD-charmod-norm-20051027/, World Wide Web Consortium. [UAX15] Davis, Mark, Unicode Standard Annex #15: Unicode Normalization Forms, Unicode 5.1, available from http://unicode.org/reports/tr15/ In the above mentioned thread on the www-international mailing list I wrote that string equal would be defined by binary equality of the strings if encoded in a common Unicode encoding form, but I think I will stick with what we decided before, that is, "code-point collation" as defined in XQuery. Regarding case mapping I have added the following formulation to the existing string-normalize-to-lower-case XACML function. "Case mapping shall be done as specified for the fn:lower-case function in [XF] with no tailoring for particular languages or environments." [XF] is http://www.w3.org/TR/2007/REC-xpath-functions-20070123/ I also noted that the existing normalize-space XACML function had no definition of whitespace. I added (like in XQuery): "The whitespace characters are defined in the metasymbol S (Production 3) of [XML].". [XML] refers to http://www.w3.org/TR/2006/REC-xml-20060816/ I have added a section for unicode security issues. --8<-- 9.3 Unicode security issues There are many security considerations related to use of Unicode. An XACML implementation SHOULD follow the advice given in the relevant version of [UTR36]. --8<-- [UTR36] refers to http://unicode.org/reports/tr36/ Best regards, Erik
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]