...
I hope in the actual specification text we can
be
precise about character counting. As we all know, with XML
we're
dealing with lexical strings, which might include character
entities, as
well as parsed XML where there is Unicode characters, but even
then there
are different conventions of dealing with composition
sequences, etc. We
probably want to cite a specific Unicode normalization form to
do the counting
on:
http://www.w3.org/TR/2005/WD-charmod-norm-20051027/
It looks like "Form C" is what the W3C is
recommending for processing, but I am not certain:
http://www.unicode.org/reports/tr15/tr15-25.html#Specification
Note: This came up in the OpenFormula
discussions,
since we have spreadsheet functions that deal with extracting
substrings
at given offsets. In that case, implementations diverged
enough that
we were only able to mark some functions as
"normalization-sensitive",
a form of implementation-dependent behavior. I really hope
that with
CT, since we're starting fresh, we can specify exactly what
normalization
form to use.
Good point and thanks for the references. I will have a look into
them.
Best regards,
Svante
|