office-formula message

Subject: Re: [office-formula] Question on character classes

From: Eike Rathke <erack@sun.com>
To: office-formula@lists.oasis-open.org
Date: Tue, 22 Dec 2009 18:25:11 +0100

Hi Patrick,

On Monday, 2009-12-21 08:42:19 -0500, Patrick Durusau wrote:

> Eike Rathke wrote:
>> On Saturday, 2009-12-19 11:34:34 -0500, Patrick Durusau wrote:
>>   
>>> Whirring away on untangling the portability portions but curious 
>>> about  the use of XML character classes.
>>>
>>> Makes sense if we are talking about XML text, makes less sense if we 
>>> are  talking about strings in general. Just use Unicode character 
>>> classes  with the correct references.
>>>
>>> Was there some overt reason for choosing XML character classes?

Reason was they were already defined in a standard.

>> What exactly are you talking about? Is this about LetterXML, DigitXML,
>> ... in the syntax definitions, referring XML10
>> http://www.w3.org/TR/REC-xml/ appendix B Character Classes?
>>
>>   
> Yes.
>> What would be the equivalent Unicode classes?
>>
>>   
> Well, for DigitXML that would be Decimal digits in the Unicode Character  
> Database.

Unicode category 'Nd', yes.

> I haven't compared the latest DigitXML to Decimal digits in the most  
> recent UCD so there may not be a significant difference.
>
> My main concern was if we are talking about supporting the full range of  
> characters in Unicode (or at least providing implementations with that  
> option) when rather than citing the XML standard and using Unicode by  
> indirection, we could also simply cite the Unicode standard.

Fine with me in general. Thinking about it may even be more accurate.
I'm not sure if, for example, LetterXML encompasses all letter
characters that may be allowed in an identifier. We'd have to define the
classes we use. It should be sufficient to do this in Unicode categories
(Nd, Ll, Lu, ...) and not list the entire sets of ranges as it is done
in XML10. If for example we could say

Identifier ::= NameStartCharacter NameCharacter*
NameStartCharacter ::= (Unicode characters of categories Ll, Lu, Lo, Lt, Nl) | '_'
NameCharacter ::= NameStartCharacter |
    (Unicode characters of categories Mc, Me, Mn, Lm, or Nd) | '.'

that would ease a lot. Is this possible?

Note that the example does include compatibility characters, which are
not part of LetterXML, but have to be added to OpenFormula. Seems I need
to create yet another issue..

  Eike

-- 
Automatic string conversions considered dangerous. They are the GOTO statements
of spreadsheets.  --Robert Weir on the OpenDocument formula subcommittee's list.

PGP signature

References:
- Question on character classes
  - From: Patrick Durusau <patrick@durusau.net>
- Re: [office-formula] Question on character classes
  - From: Eike Rathke <erack@sun.com>
- Re: [office-formula] Question on character classes
  - From: Patrick Durusau <patrick@durusau.net>