[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: Re: [office-formula] Re: (fwd) Should OpenFormula BASE() and DECIMAL()definitions list character set?
> "Dennis E. Hamilton": >> RE: [office-formula] Re: (fwd) Should OpenFormula BASE() and DECIMAL >> () definitions list character set? >> >> Yes, I meant that encoding parameter. >> >> - Dennis >> >> Here's more thinking about that. I am doing this off the top of my head >> without digging out a recent draft of OpenFormula, so I apologize if I am >> covering well-worn ground: >> >> 1. It seems reasonable to me that OpenFormula be expressed in terms of >> Unicode (not an encoding), which means it also doesn't have anything to > say >> about XML character entities or anything like that. Rob Weir said: > Yes. >> I don't know enough about the OpenFormula specification to know how any >> character-escaping (if any) is handled. But that could be kept at the >> Unicode level, without reference to an encoding. >> > > There are several levels of "encoding": > > 1) Unicode level mapping of Unicode strings into bytes. > > 2) XML level format, including the use of numerical character entities, > etc. > > 3) Any OpenFormula level level escaping. For example, I think we have an > UPPER() function to turn each character in a string to capital case. Well, > if strings parameters are delimited by quotation marks, how do we escape a > quote literal in the string? > > I'm assuming we define OpenFormula at that 3rd level only, at least within > the OpenFormula part. But in the main part we can say table:formula is an > xsd:string, with a value that is a conforming OpenFormula expression. > Calling it xsd:string in an XML file triggers the other constraints at > levels 1 and 2 in the above model. Yes, that's the intent. Actually there's a "level 0", too; most ODF documents are compressed into .zip archive, which is ALSO out-of-scope from the formula specification. The 2009-05-01 draft of OpenFormula discusses this, especially in section 2: "This specification defines OpenFormula formulas in terms of a canonical text representation used for exchange. OpenFormula formulas are normally defined as attributes in XML documents. When OpenFormula formulas are attributes in XML documents, characters shall be escaped as required by the XML specification (e.g., the character & shall be escaped in XML attributes using notations such as &). In OpenDocument, OpenFormula formulas are stored in XML attributes, particularly table:formula and text:formula; these XML documents are typically compressed using the zip format, as further described in the rest of the OpenDocument specification. These escape and compression mechanisms are outside the scope of OpenFormula." Because this is such a common misunderstanding, a more detailed note follows again in Chapter 5: "Note that formulas are typically embedded inside an XML document. When this occurs, various characters (such as "<", ">", '"', and "&") shall be escaped, as described in section 2.4 of the XML specification [XML10]. In particular, the less-than symbol "<" is typically represented as “<”, the double-quote symbol as “"”, and the ampersand symbol as “&” (alternatively, a numeric character reference can be used)." >> 2. One consequence of taking this approach is that one has to be > cognizant >> of the fact that the formulas may be carried as attribute values in >> XML-based implementations and they will (1) have to be represented in > the >> character-set encoding of that representation and (2) appropriately >> contained in quotations where the extent of the formula is unambiguously >> determined. How that's done would seem to fall entirely on the >> implementation that carries the formulas, not OpenFormula itself. ODF > 1.2 >> might need to say something about it, in its definition for table-cell >> formula, but probably not if XML attribute-value representation rules > and >> the use of attribute value-type string are sufficient. >> > > As above. I think saying it is an attribute value of type xsd:string > should be sufficient. An informative example might be useful as an > illustration. > >> 3. I think it would be useful to know, in a pure-Unicode approach, what >> minimum set of Unicode characters are required to be usable to express >> OpenFormula and what others might be allowed (e.g., in the rules for > names >> and the expression of string-valued literals used within a formula). A > BNF >> grammar that appealed to Unicode character categories could set the >> ceilings, but it is useful to know if there is a different floor. Defining a "minimal" set required for representation in XML, if I understand you correctly, is less useful than you might think. Strictly speaking, all you need is "&", ";", and the digits 0-9, and you can represent any Unicode character in an XML document. For example, "=3+2" can be represented as in an XML document as: &61;&51;&43;&50; Implementors who generate OpenFormula formulas with this XML representation, when it's CLEARLY unnecessary, should be shot on sight :-). But I can easily imagine that an implementor might choose to represent all non-ASCII characters using XML escapes, and that would work just fine. OpenFormula DOES require very broad Unicode "support", at least in the sense of exchanging values. Syntactically, constant strings can contain _any_ Unicode characters other than character 0 (see section 5.4). (The exception for 0 is because many implementations fail on this. In practice, many implementations are written in C or call C libraries, which use character 0 as the string terminator, so character 0 isn't portable.). Now to be fair, section 4.1 makes supporting arbitrary Unicode strings merely a "should" not a "shall", but an implementation has to at least read them in successfully. Frankly, I can easily see that changing to a "shall" in the future. The names of named expressions (section 5.11) accepts almost as broad a list of characters (that way, people can use their native languages for "variable names"). So you _do_ need to support a very wide range of Unicode values for reading/writing. But note that this "support" doesn't mean that you have to have all the fonts, as display notation is out-of-scope for OpenFormula. --- David A. Wheeler P.S.: To represent " inside a constant string, you double it. If this is inside an XML attribute (the usual case), then ALL of these double-quotes must be represented per XML requirements - usually by ". Sectio n 5.4 (Constant Strings) says this: Constant strings are surrounded by double-quote characters; a literal double-quote character (") as string content is escaped by duplicating it. Note that when a formula is stored in an XML attribute, XML escaping rules apply: thus inside an XML attribute double-quote characters shall be escaped (e.g., as ") and carriage return characters in a String (e.g., as 
). A constant string, as defined by this syntax, shall be considered to be type Text. {The syntax doesn't permit two constant strings to be adjacent - they have to be separated by an operator, a function parameter separator, or the like. So this is not ambiguous.}
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]