[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: Re: [office-formula] Grammar
Below is a new draft for formula syntax. What do you think? I created what's below by combining the OpenFormula syntax specification and the grammar posted by Eike Rathke. Technically, the two syntaxes are very close (expressions meeting one would typically meet the other), but by combinining the features of each presentation, we'll hopefully end up with a combination that has the advantages of both. It also needs to be a really clean and clear presentation, since if it's clear, it's less likely to have errors in it. But hopefully, this will help us get there. This is a little awkward without a Wiki, but we'll do our best. Stuff stolen from OpenFormula: * Depend on data typing, not syntax, to determine if certain operations illegal. This makes the syntax much more flexible in terms of handling new types (it allows existing operations to handle new types). In particular, see "ReferenceOp" - by doing things this way, it's VERY easy to have functions return references and have them interact with reference constants, with extremely regular rules. * Intersection is ALWAYS '!', cell concatenation is always '~'. NLF is trivially handled by having NLF labels be NamedExpressions that happen to return references. * Define whitespace handling separately; this greatly simplifies the rules, and it's how an implementor is likely to do lexical analysis anyway. * Specify operator precedence * Don't try to use "(Expression - String)" to limit applications; it doesn't really work :-). * Empty parameters (to support Gnumeric, Excel, SheetToGo) * Include a syntax for whole-row, whole-column. * NamedExpression may have an external Source. Stuff stolen from Eike Rathke's grammar: * Use these naming conventions and organization * Include namespace identification in grammar * Initial "=" for forced recalculation * Just merge into a single syntax (instead of separating by level) * Use CamelCase * Use spaces in the EBNF for readability. * Interleave syntax and discussion more freely. Other stuff: * String definition changed - now allows all control chars except ASCII 0. * Functions _always_ take a ParameterList, which has 0 or more parameters. In practice, functions don't ask "did I get a parameterlist"... they ask "how many parameters did I get" (with 0 being a possible answer for some functions). This makes the syntax match actual practice. The text below looks a lot like Eike's proposal, except in the section on (inline) cell references... where it doesn't :-). ========================================================== Formula Syntax Any implementation MUST support the syntax as described below (except for the noted exceptions). Any reading implementation must be able to read /at least/ the syntax defined in this section (it may accept various extensions), and any writing implementation must generate this format when a user creates data that can comply with it. This syntax is defined using the BNF notation of XML version 1.1, as described at http://www.w3.org/TR/2004/REC-xml11-20040204/#sec-notation. Note that each syntax rule is defined using "::=". Text is interspersed to explain these syntax rules and their implications. Note that formulas are typically embedded inside an XML document. When this occurs, various characters (such as "<", ">", '"', and "&") must be escaped, as described in the XML specification. In particular, "<" becomes <, '"' becomes ", and "&" becomes &. FormulaContent ::= Namespace Formula Namespace ::= Namespace_in_XML ':' Namespace_in_XML ::= http://www.w3.org/TR/REC-xml-names The namespace tells the reading application how to treat formula content written by a specific application or an application conforming to a certain dialect of this formula language. For OpenDocument files written by OpenOffice.org versions prior to this specification (including versions 2.0, 2.0.1, 2.0.2, and 2.0.3), it is xmlns:oooc="http://openoffice.org/2004/calc". TODO: The namespace for this specification is ???. Formula ::= '=' ForceRecalc? Expression ForceRecalc ::= '=' NOTE: I changed "Expression Expression*" to just "Expression" here. If you want a list of expressions, that should turn into a single expression anyway. If the second '=' is present, this is a "forced recalculate" hint. Typical implementations optimize what is recalculated; they typically only recalculate what is displayed and what those values depend on (transitively). A forced recalculate hints that this formula should be recalculated whenever recalculation takes place, even if it isnt' apparantly needed (e.g., due to side-effects of macros). Expression ::= Number | String | Array | PrefixOp Expression | Expression PostfixOp | Expression InfixOp Expression | '(' Expression ')' | FunctionName '(' ParameterList ')' | NamedExpression | Reference Number ::= WrittenNumber | '.' [0-9]+ ([eE] [-+]? [0-9]+)? WrittenNumber ::= [0-9]+ ('.' [0-9]+)? ([eE] [-+]? [0-9]+)? Numbers are written according to the "C" or en-US locale; the '.' dot is used as the decimal separator and the group (AKA thousand) separators are not written. Optional "E" or "e" denote scientific notation. Syntactically, negative numbers are simply numbers with a prefix "-" operator; implementations MAY and typically do optimize this operator by performing the negation and storing the negative number directly. Readers SHOULD be able to read a fraction that starts with '.' (without a leading zero). Writers MUST write numbers with a leading digit (the WrittenNumber format), and MAY NOT write numbers with a leading '.'. String ::= '"' ([^"#x00] | '""')* '"' A literal double-quote character (") as string content is escaped by duplicating it. Note that since a formula is normally stored as an XML attribute, all double-quotes are written as their entity ". NOTE: Changed to allow all control characters except ASCII 0. Array ::= TODO, which separators? PrefixOp ::= '+' | '-' These are the unary prefix operators, prefix plus (a no-op) and prefix minus (negation). PostfixOp ::= '%' This is the unary percentage operator, dividing the preceding expression by 100. InfixOp ::= ArithmeticOp | ComparisonOp | '&' | ReferenceOp The '&' ampersand is the string concatenation operator. Note that since a formula is typically stored as an XML attribute, an '&' ampersand is written as the entity & ArithmeticOp ::= '+' | '-' | '*' | '/' | '^' These are Addition, Subtraction, Multiplication, Division, and Exponentiation (power). ComparisonOp ::= '=' | '<>' | '<' | '>' | '<=' | '>=' These are EqualTo, UnequalTo, LessThan, GreaterThan, LessThanOrEqualTo, GreaterThanOrEqualTo. ReferenceOp ::= IntersectionOp | CellConcatenationOp | RangeOp IntersectionOp ::= '!' CellConcatentationOp ::= '~' RangeOp ::= ':' These are operations on references: Intersection (which some user interfaces display as a space), cell concatenation (which some user interfaces display as the function parameter seperator), and range extension (given two ends, creates a range that covers both ends). Note that since they are defined as general operators, either or both sides may be the output of a function. The operations are defined on their own lines, in case this is helpful in describing "alternative" syntaxes. FunctionName ::= Identifier Identifier ::= LetterXML (LetterXML | DigitXML | '_' | '.' | CombiningCharXML)* LetterXML ::= http://www.w3.org/TR/REC-xml/#NT-Letter DigitXML ::= http://www.w3.org/TR/REC-xml/#NT-Digit CombiningCharXML ::= http://www.w3.org/TR/REC-xml/#NT-CombiningChar Note that the function names defined in this specification use English text, not a translated UI representation. Thus predefined function names actually meet the stricter definition [A-Za-z] [A-Za-z0-9_.]* However, letter characters as defined for an Identifier are allowed, since user-defined functions may use such characters. Function names are case-insensitive, so SUM and Sum are the same function. However, traditionally spreadsheet function names are written in all upper case, so this tradition is followed in this specification. Implementations SHOULD write function names in all upper case. *Rationale:* Function names are often displayed in all upper case by both implementations, showed this way in documentation, and saved in uppercase, owing to the influence of the first spreadsheet program (VisiCalc). By saying implementations SHOULD save in all uppercase, this increases the likelihood that implementations can trivially write back formulas and, if the formulas are unchanged, produce an identical result. This can be helpful to systems that depend on detecting differences, by minimizing the number of unnecessary differences. Some functions always produce the same value, and are thus constants; PI() and TRUE() are examples. *Note:* Microsoft Excel accepts empty parameters in any position; OpenOffice 1.1.3 does not, so the current syntax above doesn't either. See below (level 3) for how this rule can be relaxed. Typical implementations will have many built-in functions, and most implementations also support one or more ways to create user-defined functions. *Rationale:* Excel uses the "," as the function parameter separator, but OpenFormula uses ";" instead. Many locales use "," as the decimal separator; using the semicolon as the parameter separator eliminates confusion and the risk of incorrect implementation. ParameterList ::= /* empty */ | Parameter ( Separator EmptyOrParameter )* | Separator EmptyOrParameter /* First param empty */ ( Separator EmptyOrParameter )* EmptyOrParameter ::= /* empty */ | Parameter Parameter ::= Expression Separator ::= ';' Functions may be given a parameter list. Note that given this definition, SIN() has a parameter list with 0 parameters, NOT one parameter that happens to be empty. TODO: If there is a need for passing 1 parameter that is empty, could define a function like EMPTY() for representing an empty parameter. Reference ::= '[' Source? RangeAddress ']' RangeAddress ::= CellAddress ( ':' CellAddress )? | SheetName "." ColumnLabel ':' SheetName "." ColumnLabel | SheetName "." RowLabel ':' SheetName "." RowLabel CellAddress ::= SheetName "." ColumnLabel RowLabel SheetName ::= /* empty */ | '$' ( [A-Za-z0-9]+ | ("'" ([^']|"''")+ "'" ) ) ColumnLabel ::= [A-Z]+ RowLabel ::= [1-9] [0-9]* References always begin with '[' in the storage format; this immediately disambiguates cell addresses from function names and named expressions. User interfaces need not (and often do not) display the surrounding [...]. Sheetnames include ' characters by doubling them. Column labels MUST be in uppercase; this makes it easier to distinguish between column labels and many named expressions. The syntax above supports whole-row and whole-column refernces, but they must always have a range marker. Note that TODO: Ideally we'd allow Sheetnames to not start with "$", but then if given ['..., is that a start of a SOURCE or a SHEETNAME? TODO: R1C1 and relative sheetnames for CellAddress? TODO: Errors. Note: See OpenDocument 1.0 section 8.3.1's definition of cellRangeAddress and cellAddress. The "$" markings note that the following component is absolute, otherwise it is relative; these markings have no effect on calculation (they are intended for use when copying or moving formulas). *Rationale:* Cell addresses in OpenFormula begin with "[" and end with a "]"; this makes parsing simpler, faster, and more reliable. Cell addresses are specified in A1 notation, not an R1C1 notation, as required by OpenDocument 1.0 section 8.1.3 subsection "Formula". Not using R1C1 notation can in some cases cause an increase in compressed file size (because copied formulas are shown differently and thus do not compress as well). However, A1 notation is much easier for humans to understand, so using A1 format is likely to increase reliability (because it is more likely to be correctly generated and interpreted). If this is a serious problem, Array formulas can usually be used to achieve the same results. Both R1C1 and A1 /could/ be allowed, but then different spreadsheets are likely to generate different characters for the same cell, and thus create "differences" that do not exist even in simple spreadsheets. Only uppercase characters are allowed, for the same reason. *Note:* Typical spreadsheet displays will often /not/ display or require input of the square brackets. OpenDocument 1.0 section 8.3.1 also defines a cellRangeAddressList (space-separated ranges) which is used by several OpenDocument constructs. OpenFormula does not use cellRangeAddressList; cell concatenation can do the same thing when this is necessary. *Rationale:* Supporting whole-rows and whole-columns is useful; Excel 2003 supports this. OpenOffice 2.0 does not directly support it, but it fakes it very well by translating a "whole row" or "whole column" entry into a specific range. As long as spreadsheets only go from A1 to IV65536, there is no difference, and OpenOffice.org 2.0's approach is probably fine for many people. However, spreadsheet implementations of the future are likely to support larger limits, at which point these spreadsheets will quietly fail. It would be wiser to directly support these selectors so that large spreadsheets of the future would not silently fail. NamedExpression ::= Source? Identifier Source ::= "'" URI "'" "#" A NamedExpression references another expression, possibly in a completely different spreadsheet. TODO: In Excel, NamedExpressions can be sheet-specific or global, but there's no syntax to reference one or the other (a potential problem for external references in particular). Should we do better? The URI in the source_location MUST support absolute URIs (URLs are URIs). Implementations SHOULD support relative URLs, which can be distinguished because they do not begin with [A-Za-z]+ ":". If a relative URL looks like an absolute URL, it MUST save relative URLs using a "./" prefix, e.g., './mystuff.xls'. Implementations MUST support they file:/// prefix; they MAY support others, but beware of the security ramifications (an attacker could use this ability to track someone when they download a file to extract data from it, or use this to send malicious data). Implementations SHOULD ask the user before loading data from any external data sources. TODO: Discuss IRIs? TODO: exceptions? Whitespace Whitespace (space, tab, newline, and carriage return) is ignored in the default formulas syntax, except in the contents of string constants. A line break is normally represented by a single ASCII character 10 (\n); when embedded in an XML document this is typically represented as " ". Implementations SHOULD retain whitespace entered by the original formula creator, and SHOULD NOT add additional whitespace unless directed to do so during the process of editing a formula. *Note:* In Microsoft Excel's display format, the space is used as the intersection operator. This is very confusing, so the OpenFormula exchange format uses "!" instead for intersection. [edit </index.php?title=Expression_Syntax&action=edit§ion=9>] Precedence The operator have the following precedence, from lowest to highest: Associativity Operator(s) Comments left <, =, >, <=, >=, <>, !=, == Less than, equal to, greater than, less than or equal to, greater than or equal to, not equal to, not equal to (alternative), equal to (alternative). left & Binary operation string concatenation. Note that unary (prefix) + and "#147; has a different priority. Note that "&" must be escaped when included in an XML document, typically as "&". left +,- Binary operations add, subtract. Note that unary (prefix) + and "#147; have a different priority. left *,/ Multiply, divide. Division does not truncate, so 1/2 is equal to 0.5. right ^ Power (2^3 is 8). left % Postfix unary operator % (divide by 100). Note that this is legal with expressions (e.g., [.B1]%) and it can be duplicated (1), and it does /not/ change the meaning of other operations such as "+". right +,- Prefix unary operators, e.g., -5 or -[.A1]. Note that these have a difference precedence than add and subtract. left ! Cell address intersection ([.A1:.C4]![.B1:.B5] is [.B1:.B4]). left ~ Reference union. TODO: Precedence of ":" (when not inside [...]). Precedence can be overridden by using parentheses, so "2+3*4" computes 14 while "(2+3)*4" computes 20. Implementations' user interfaces may display these operators differently or with a different precedence, but when exchanging formulas they /MUST/ use the precedence rules here. Implementations /SHOULD/ retain and regenerate "unnecessary" parentheses and white space, since these are added by people to improve readability. The precedence of unary prefix "-" compared to "^" has initiated more than a little research. In this specification prefix "-" has higher precedence than "^", because "-2^2" is "4" in Microsoft Excel, OpenOffice.org, Gnumeric, and probably many others. This is not universal; prefix "-" has a /lower/ precedence on Lotus 1-2-3, Quattro Pro, and Excel's own Visual Basic (Walkenbach, 2004, pg. 579), so these products will need to insert and remove parentheses when reading/writing expressions in OpenFormula where this matters. However, the /vast/ majority of spreadsheet implementations used today treat "-" as higher precedence, so as usual we identified and documented common conventions. Using a different precedence would create a high likelihood of error and additional work in implementations, for no good reason. Ideally OpenFormula could also be used for attribute draw:formula as defined in OpenDocument 1.0 section 9.5.5 and attribute anim:formula as defined in 13.3.2; neither of these attributes includes an exponentiation operator "^", so this issue of precedence causes no problem. (Note that they require "," as the function separator, and have other requirements, as discussed later in this document). Prefix "-" and "+" are right-associative, not non-associative, because "=--[.B3]" is legal (and it converts B3 to a number, so it can have an effect). Cell address intersection has an even higher precedence, so that a unary minus in front an intersection will work correctly. *Rationale:* Walkenbach gives Microsoft Excel 2003's precedence levels as (lowest to highest, note that the book gives the reverse order) comparison (such as "="), "&", "+" and "-", "*" and "/", "^", "%", and unary "-" (negation). (Walkenbach, 2004, pg 38). By having the precedence be the same as Excel, it simplifies transition. *Note:* this format is intentionally similar to traditional presentations of spreadsheet formulas, which reduces the likelihood of error or misunderstanding. Although the user interface is not specified here, the representation is intentionally chosen so that formulas can "round trip" to this format and back without loss using typical formula representations. It would be possible to replace some or all uses of "%" with "/100", and to replace "^" with POWER(), but this would cause the display format to change once it was saved and reloaded. MISCELLANEOUS TEXT - stuff yet to be dealt with... Future versions could handle OpenDocument's anim:formula and draw:formula. They will need to add variable syntax such as $0, $a, and $. In addition, the issue of using "," instead of ";" as a function parameter separator will need to be resolved to accept those formulas. Future versions may also accept constant array specifiers, e.g., "{" list "}" - and in that list, there must be a way to separate different values inside a row as well as a way to separato different rows (Excel uses "," to separate entries within a row, and ";" to separate entries in different rows). OpenOffice.org 2.0 always generates absolute Sheetnames if it's not the current sheet (i.e., '$' always prefixes a Sheetname). However, the OpenDocument 1.0 specification 8.3.1 clearly makes the "$" optional in front of a sheetname. *Discussion:* ??? Named values are actually named expressions in Excel, and can have references ("the cell row-2 and column+1 from here") and be reused in many different cells. Thus, there appears to be a real need to support an R[]C[]-style reference format, possibly at level 3. Probably should also support an argument format, so that they can be used as user-defined formulas, using the sytax compatible with the other formula-like constructs in OpenDocument (e.g., $1). However, once you start changing the fundamentals of how cell references are made, should we abandon the OOo storage format? After all, the current syntax already doesn't well-support arbitrary range operations. The Gnumeric folks argue for using their syntax instead; details of this proposal have not yet been seen. See mailing list. *TBD:* Need to specify how to save relative sheet names, and how to handle ' embedded in sheet names. *TBD:* Need to specify how to save "sheet deleted" ("#REF") sheet references. [edit </index.php?title=Expression_Syntax&action=edit§ion=8>] An implementation MUST write a cell range address ":" separator, and MUST NOT write the cell range extension operator, where it is possible to do so. That is, an implementation MUST write the expression "[.A1:.A3]" instead of "[.A1]:[.A3]" even though they are semantically identical. Where there is a choice of which cells to join inside a cell range address, an implmentation MUST choose the leftmost one. *Rationale:* This specification does not mandate a cell concatenation operator at level 2, since OpenOffice.org 1.1.3 and 2.0 do not include one. However, Microsoft Excel includes this operation. In Excel this is represented using the comma (",") character, the same symbol used as the parameter separator for function calls. This is a very poor choice with a number of unfortunate ramifications. One problem is that concatentating cells in a function parameter requires surrounding the cells with additional parentheses in Excel display syntax. For example, AREAS(A1:A3,B2:B4) is a function call with two parameters, while AREAS((A1:A3,B2:B4)) is a function call with one parameter. Another problem is that the comma interferes with the use of "," as a decimal separator (as it is used in many locales) when using traditional entry formats (which do not mark cell addresses with "[".."]"). Gnumeric uses "+" as the cell concatenation operator in its display, but this has its own problems: it interferes with the use of "+" as a matrix addition operator. There are many alternatives, e.g., other characters (such as "~", "|", and "\"), or requiring a function syntax for this purpose. The character "_" would be a poor choice because formula variables can also include this character in their name (complicating parsing when "[..]" are not used "#147; is B3_B2 a formula variable, or are B2 and B3 concatenated?). Because of these issues, this document proposes using "~" as the cell concatenation symbol. A special range-extension operator is needed to deal with cases such as more than one range extension, where one or more sides is a name, or where one or more sides is calculated as the result of an expression. But where possible it shouldn't be used, for maximum portability, since level 2 systems don't necessarily implement it. The leftmost rule is noted so that implementations are likely to generate the same text for the same formula, eliminating the appearance of change when none occurs. Issue: Subtables. "Level 3 implmentations MUST support addressing subtables (tables in tables), using this syntax: cellRangeAddress ::= cellRangeAddress.level2 | "[A-Z]+" ":" "[A-Z]+" | /* whole column(s) */ "[0-9]+" ":" "[0-9]+" /* whole row(s) */ cellAddress ::= "$"? sheet_name ( "." in_sheet_address )+ *Rationale:* OpenDocument 1.0 section 8.3.1 discusses subtables, and requires this syntax when addressing subtables. However, it doesn't require that subtables actually be supported in its syntax for cellAddress. [edit </index.php?title=Expression_Syntax&action=edit§ion=14>] Extension: R1C1 Notation Level 3 implementations MUST support R1C1 notation. cellRangeAddress ::= cellRangeAddress.level2 | "[A-Z]+" ":" "[A-Z]+" | /* whole column(s) */ "[0-9]+" ":" "[0-9]+" /* whole row(s) */ cellAddress ::= "$"? sheet_name ( "." in_sheet_address )+ in_sheet_address ::= ( "$"? [A-Z]+ "$"? [0-9]+ ) | ( [-+$] [0-9]+ [-+$] [0-9]+ ) The symbols + and - indicate a relative reference; "$" indicates an absolute reference. The initial "r" /MUST/ be in lowercase letters. *Rationale:* R1C1 notation tends to produce much better compression results, it supports large spreadsheets (larger than IV65536) more easily, and better supports textual XML diffs. In earlier versions this had a lowercase 'r' and 'c', but this was unnecessary. *TBD:* It's not clear this particular syntax is a good idea; discussion welcome. *Rationale:* OpenOffice.org 2.0 supports source locations for Cell addresses like A1, but not for named values, so this was moved to level 3. [edit </index.php?title=Expression_Syntax&action=edit§ion=16>] Extension: In-line arrays The syntax above does not include a specification for arrays specified in-line. Microsoft Excel includes a way to write arrays of constants without having to create a cell range for them. OpenOffice.org 1.1.3 does not support this. Excel's display format uses a comma as the separator between values in a row, and a semicolon for the separator between rows. Here semicolon is used as the separator between values in a row (again, so different locales will have a simpler time entering data when entering data), and the pipe symbol "|" is proposed as the symbol separating rows (with absolutely no precedent). array_spec ::= "{" matrix ( '_' matrix )* "}" matrix ::= ( matrix_row ( column_separator matrix_row )* ) ? matrix_row ::= expression ( ";" expression )* column_separator ::= "|" *TBD:* This is a proposed syntax, which in addition permits non-constant values in the array (something Excel does not permit) and concatenation (which Excel may not permits either). Gnumeric is adding the ability to handle matrixes and NOT forcing them into a fixed cell array; these are called dynamic matrices. Does this require syntactic extensions? Need to discuss in types and functions. Need to re-investigate in-line arrays and other array function syntax issues. Should we add arguments (e.g., "$1") so that anim:formula, etc. are easier to support AND so that named expressions are much more effective (they'd then allow the easy definition of new functions)? --- David A. Wheeler
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]