office-formula message

Subject: Re: [office-formula] Re: (fwd) Should OpenFormula BASE() and DECIMAL()definitions list character set?

From: "David A. Wheeler" <dwheeler@dwheeler.com>
To: "office-formula@lists.oasis-open.org" <office-formula@lists.oasis-open.org>
Date: Fri, 08 May 2009 16:15:31 -0400

> "Dennis E. Hamilton":
>> RE: [office-formula] Re: (fwd) Should OpenFormula BASE() and DECIMAL
>> () definitions list character set?
>>
>> Yes, I meant that encoding parameter.
>>
>>  - Dennis
>>
>> Here's more thinking about that.  I am doing this off the top of my head
>> without digging out a recent draft of OpenFormula, so I apologize if I am
>> covering well-worn ground:
>>
>> 1. It seems reasonable to me that OpenFormula be expressed in terms of
>> Unicode (not an encoding), which means it also doesn't have anything to 
> say
>> about XML character entities or anything like that.

Rob Weir said:
> Yes.

>> I don't know enough about the OpenFormula specification to know how any
>> character-escaping (if any) is handled.  But that could be kept at the
>> Unicode level, without reference to an encoding.
>>
> 
> There are several levels of "encoding":
> 
> 1) Unicode level mapping of Unicode strings into bytes.
> 
> 2) XML level format, including the use of numerical character entities, 
> etc.
> 
> 3) Any OpenFormula level level escaping.  For example, I think we have an 
> UPPER() function to turn each character in a string to capital case. Well, 
> if strings parameters are delimited by quotation marks, how do we escape a 
> quote literal in the string?
> 
> I'm assuming we define OpenFormula at that 3rd level only, at least within 
> the OpenFormula part.  But in the main part we can say table:formula is an 
> xsd:string, with a value that is a conforming OpenFormula expression. 
> Calling it xsd:string in an XML file triggers the other constraints at 
> levels 1 and 2 in the above model.

Yes, that's the intent. Actually there's a "level 0", too; most ODF 
documents are compressed into .zip archive, which is ALSO out-of-scope 
from the formula specification.

The 2009-05-01 draft of OpenFormula discusses this, especially in section 2:
"This specification defines OpenFormula formulas in terms of a canonical 
text representation used for exchange. OpenFormula formulas are normally 
defined as attributes in XML documents. When OpenFormula formulas are 
attributes in XML documents, characters shall be escaped as required by 
the XML specification (e.g., the character & shall be escaped in XML 
attributes using notations such as &amp;). In OpenDocument, OpenFormula 
formulas are stored in XML attributes, particularly table:formula and 
text:formula; these XML documents are typically compressed using the zip 
format, as further described in the rest of the OpenDocument 
specification. These escape and compression mechanisms are outside the 
scope of OpenFormula."

Because this is such a common misunderstanding, a more detailed note 
follows again in Chapter 5:
"Note that formulas are typically embedded inside an XML document. When 
this occurs, various characters (such as "<", ">", '"', and "&") shall 
be escaped, as described in section 2.4 of the XML specification 
[XML10]. In particular, the less-than symbol "<" is typically 
represented as “&lt;”, the double-quote symbol as “&quot;”, and the 
ampersand symbol as “&amp;” (alternatively, a numeric character 
reference can be used)."


>> 2. One consequence of taking this approach is that one has to be 
> cognizant
>> of the fact that the formulas may be carried as attribute values in
>> XML-based implementations and they will (1) have to be represented in 
> the
>> character-set encoding of that representation and (2) appropriately
>> contained in quotations where the extent of the formula is unambiguously
>> determined.  How that's done would seem to fall entirely on the
>> implementation that carries the formulas, not OpenFormula itself.  ODF 
> 1.2
>> might need to say something about it, in its definition for table-cell
>> formula, but probably not if XML attribute-value representation rules 
> and
>> the use of attribute value-type string are sufficient.
>>
> 
> As above.  I think saying it is an attribute value of type xsd:string 
> should be sufficient.  An informative example might be useful as an 
> illustration.
> 
>> 3. I think it would be useful to know, in a pure-Unicode approach, what
>> minimum set of Unicode characters are required to be usable to express
>> OpenFormula and what others might be allowed (e.g., in the rules for 
> names
>> and the expression of string-valued literals used within a formula).  A 
> BNF
>> grammar that appealed to Unicode character categories could set the
>> ceilings, but it is useful to know if there is a different floor.

Defining a "minimal" set required for representation in XML, if I 
understand you correctly, is less useful than you might think.  Strictly 
speaking, all you need is "&", ";", and the digits 0-9, and you can 
represent any Unicode character in an XML document.  For example, "=3+2" 
can be represented as in an XML document as:
&61;&51;&43;&50;

Implementors who generate OpenFormula formulas with this XML 
representation, when it's CLEARLY unnecessary, should be shot on sight 
:-).  But I can easily imagine that an implementor might choose to 
represent all non-ASCII characters using XML escapes, and that would 
work just fine.

OpenFormula DOES require very broad Unicode "support", at least in the 
sense of exchanging values. Syntactically, constant strings can contain 
_any_ Unicode characters other than character 0 (see section 5.4).  (The 
exception for 0 is because many implementations fail on this.  In 
practice, many implementations are written in C or call C libraries, 
which use character 0 as the string terminator, so character 0 isn't 
portable.).  Now to be fair, section 4.1 makes supporting arbitrary 
Unicode strings merely a "should" not a "shall", but an implementation 
has to at least read them in successfully.  Frankly, I can easily see 
that changing to a "shall" in the future.

The names of named expressions (section 5.11) accepts almost as broad a 
list of characters (that way, people can use their native languages for 
"variable names").  So you _do_ need to support a very wide range of 
Unicode values for reading/writing.

But note that this "support" doesn't mean that you have to have all the 
fonts, as display notation is out-of-scope for OpenFormula.

--- David A. Wheeler


P.S.: To represent " inside a constant string, you double it.  If this 
is inside an XML attribute (the usual case), then ALL of these 
double-quotes must be represented per XML requirements - usually by 
&quot;.  Sectio n 5.4 (Constant Strings) says this:

Constant strings are surrounded by double-quote characters; a literal 
double-quote character (") as string content is escaped by duplicating 
it. Note that when a formula is stored in an XML attribute, XML escaping 
rules apply: thus inside an XML attribute double-quote characters shall 
be escaped (e.g., as &quot;) and carriage return characters in a String 
(e.g., as &#x0D;). A constant string, as defined by this syntax, shall 
be considered to be type Text.


{The syntax doesn't permit two constant strings to be adjacent - they 
have to be separated by an operator, a function parameter separator, or 
the like.  So this is not ambiguous.}

References:
- Re: (fwd) Should OpenFormula BASE() and DECIMAL() definitions listcharacter set?
  - From: "David A. Wheeler" <dwheeler@dwheeler.com>
- Re: (fwd) Should OpenFormula BASE() and DECIMAL() definitions listcharacter set?
  - From: robert_weir@us.ibm.com
- RE: [office-formula] Re: (fwd) Should OpenFormula BASE() and DECIMAL() definitions list character set?
  - From: "Dennis E. Hamilton" <dennis.hamilton@acm.org>
- RE: [office-formula] Re: (fwd) Should OpenFormula BASE() and DECIMAL() definitionslist character set?
  - From: robert_weir@us.ibm.com
- RE: [office-formula] Re: (fwd) Should OpenFormula BASE() and DECIMAL() definitionslist character set?
  - From: robert_weir@us.ibm.com