OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

office-formula message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: Re: [office-formula] Grammar



Below is a new draft for formula syntax.  What do you think?
I created what's below by combining the
OpenFormula syntax specification and the grammar
posted by Eike Rathke.  Technically, the two syntaxes are very close
(expressions meeting one would typically meet the other), but
by combinining the features of each presentation, we'll hopefully
end up with a combination that has the advantages of both.
It also needs to be a really clean and clear presentation, since
if it's clear, it's less likely to have errors in it.

But hopefully, this will help us get there.  This is a little
awkward without a Wiki, but we'll do our best.

Stuff stolen from OpenFormula:
  * Depend on data typing, not syntax, to determine if certain operations
    illegal.  This makes the syntax much more flexible in terms of
    handling new types (it allows existing operations to handle new types).
    In particular, see "ReferenceOp" - by doing things this way, it's
    VERY easy to have functions return references and have them interact
    with reference constants, with extremely regular rules.
  * Intersection is ALWAYS '!', cell concatenation is always '~'.
    NLF is trivially handled by having NLF labels be NamedExpressions that
    happen to return references.
  * Define whitespace handling separately; this greatly simplifies the
    rules, and it's how an implementor is likely to do lexical analysis 
anyway.
  * Specify operator precedence
  * Don't try to use "(Expression - String)" to limit applications; it
    doesn't really work :-).
  * Empty parameters (to support Gnumeric, Excel, SheetToGo)
  * Include a syntax for whole-row, whole-column.
  * NamedExpression may have an external Source.

Stuff stolen from Eike Rathke's grammar:
  * Use these naming conventions and organization
  * Include namespace identification in grammar
  * Initial "=" for forced recalculation
  * Just merge into a single syntax (instead of separating by level)
  * Use CamelCase
  * Use spaces in the EBNF for readability.
  * Interleave syntax and discussion more freely.

Other stuff:
  * String definition changed - now allows all control chars except ASCII 0.
  * Functions _always_ take a ParameterList, which has 0 or more
    parameters.  In practice, functions don't ask "did I get a
    parameterlist"... they ask "how many parameters did I get"
    (with 0 being a possible answer for some functions).  This makes
    the syntax match actual practice.

The text below looks a lot like Eike's proposal, except in the
section on (inline) cell references... where it doesn't :-).


==========================================================
Formula Syntax

Any implementation MUST support the syntax as described below
(except for the noted exceptions).
Any reading implementation must be able to read /at least/ the syntax
defined in this section (it may accept various extensions), and any
writing implementation must generate this format when a user creates
data that can comply with it.

This syntax is defined using the BNF notation of XML version 1.1, as
described at http://www.w3.org/TR/2004/REC-xml11-20040204/#sec-notation.
Note that each syntax rule is defined using "::=".
Text is interspersed to explain these syntax rules and their implications.

Note that formulas are typically embedded inside an XML
document. When this occurs, various characters (such as "<", ">", '"',
and "&") must be escaped, as described in the XML specification.
In particular, "<"  becomes &lt;, '"' becomes &quot;, and "&"
becomes &amp;.

  FormulaContent          ::=     Namespace Formula

  Namespace               ::=     Namespace_in_XML ':'
  Namespace_in_XML        ::=     http://www.w3.org/TR/REC-xml-names

The namespace tells the reading application how to treat formula content
written by a specific application or an application conforming to
a certain dialect of this formula language. For OpenDocument files
written by OpenOffice.org versions prior to this specification (including
versions 2.0, 2.0.1, 2.0.2, and 2.0.3), it is
xmlns:oooc="http://openoffice.org/2004/calc";.

TODO: The namespace for this specification is ???.

  Formula                ::=     '=' ForceRecalc? Expression
  ForceRecalc            ::=     '='

NOTE: I changed "Expression Expression*" to just "Expression" here.
If you want a list of expressions, that should turn into a single
expression anyway.

If the second '=' is present, this is a "forced recalculate" hint.
Typical implementations optimize what is recalculated; they typically
only recalculate what is displayed and what those values depend on
(transitively).  A forced recalculate hints that this formula should be
recalculated whenever recalculation takes place, even if it isnt'
apparantly needed (e.g., due to side-effects of macros).

  Expression             ::=     Number |
                                 String |
                                 Array |
                                 PrefixOp Expression |
                                 Expression PostfixOp |
                                 Expression InfixOp Expression |
                                 '(' Expression ')' |
                                 FunctionName '(' ParameterList ')' |
                                 NamedExpression |
                                 Reference

  Number                  ::=     WrittenNumber |
                                  '.' [0-9]+ ([eE] [-+]? [0-9]+)?
  WrittenNumber           ::=     [0-9]+ ('.' [0-9]+)? ([eE] [-+]? [0-9]+)?

Numbers are written according to the "C" or en-US locale;
the '.' dot is used as the decimal separator and the group (AKA thousand)
separators are not written. Optional "E" or "e"
denote scientific notation.
Syntactically, negative numbers are simply numbers with a
prefix "-" operator; implementations MAY and typically do
optimize this operator by performing the negation and storing the
negative number directly.

Readers SHOULD be able to read a fraction that starts with '.'
(without a leading zero). Writers MUST write numbers with a leading digit
(the WrittenNumber format), and MAY NOT write numbers with a leading '.'.


  String                  ::=     '"' ([^"#x00] | '""')* '"'

A literal double-quote character (") as string content is escaped by
duplicating it.
Note that since a formula is normally stored as an XML attribute, all
double-quotes are written as their entity &quot;.

NOTE: Changed to allow all control characters except ASCII 0.

  Array                   ::= TODO, which separators?

  PrefixOp                ::=     '+' | '-'

These are the unary prefix operators, prefix plus (a no-op) and prefix minus
(negation).

  PostfixOp               ::=     '%'

This is the unary percentage operator,
dividing the preceding expression by 100.

  InfixOp                 ::=     ArithmeticOp | ComparisonOp | '&' |
                                  ReferenceOp

The '&' ampersand is the string concatenation operator.
Note that since a formula is typically stored as an XML attribute, an '&'
ampersand is written as the entity &amp;

  ArithmeticOp            ::=     '+' | '-' | '*' | '/' | '^'

These are Addition, Subtraction, Multiplication, Division, and
Exponentiation (power).

  ComparisonOp            ::=     '=' | '<>' | '<' | '>' | '<=' | '>='

These are EqualTo, UnequalTo, LessThan, GreaterThan, LessThanOrEqualTo,
GreaterThanOrEqualTo.

  ReferenceOp             ::=  IntersectionOp | CellConcatenationOp | 
RangeOp
  IntersectionOp          ::=  '!'
  CellConcatentationOp    ::=  '~'
  RangeOp                 ::=  ':'

These are operations on references: Intersection (which some user
interfaces display as a space), cell concatenation
(which some user interfaces display as the function parameter seperator),
and range extension (given two ends, creates a range that covers both ends).
Note that since they are defined as general operators, either or both
sides may be the output of a function.
The operations are defined on their own lines, in case this is helpful
in describing "alternative" syntaxes.

  FunctionName            ::=     Identifier

  Identifier              ::=     LetterXML (LetterXML | DigitXML |
                                    '_' | '.' | CombiningCharXML)*
  LetterXML               ::=     http://www.w3.org/TR/REC-xml/#NT-Letter
  DigitXML                ::=     http://www.w3.org/TR/REC-xml/#NT-Digit
  CombiningCharXML        ::= 
http://www.w3.org/TR/REC-xml/#NT-CombiningChar

Note that the function names defined in this specification use English text,
not a translated UI representation.
Thus predefined function names actually meet the stricter definition
[A-Za-z] [A-Za-z0-9_.]*
However, letter characters as defined for an Identifier are allowed, since
user-defined functions may use such characters.

Function names are case-insensitive, so SUM and Sum are the same function.
However, traditionally spreadsheet function names are written in
all upper case, so this tradition is followed in this specification.
Implementations SHOULD write function names in all upper case.

*Rationale:* Function names are often displayed in all upper case by
both implementations, showed this way in documentation, and saved in
uppercase, owing to the influence of the first spreadsheet program
(VisiCalc). By saying implementations SHOULD save in all uppercase, this
increases the likelihood that implementations can trivially write back
formulas and, if the formulas are unchanged, produce an identical
result. This can be helpful to systems that depend on detecting
differences, by minimizing the number of unnecessary differences.
Some functions always produce the same value, and are thus constants;
PI() and TRUE() are examples.

*Note:* Microsoft Excel accepts empty parameters in any position;
OpenOffice 1.1.3 does not, so the current syntax above doesn't either.
See below (level 3) for how this rule can be relaxed. Typical
implementations will have many built-in functions, and most
implementations also support one or more ways to create user-defined
functions.

*Rationale:* Excel uses the "," as the function parameter separator, but
OpenFormula uses ";" instead. Many locales use "," as the decimal
separator; using the semicolon as the parameter separator eliminates
confusion and the risk of incorrect implementation.


  ParameterList    ::= /* empty */ |
                       Parameter ( Separator EmptyOrParameter )* |
                       Separator EmptyOrParameter   /* First param empty */
                         ( Separator EmptyOrParameter )*
  EmptyOrParameter ::= /* empty */ | Parameter
  Parameter        ::= Expression
  Separator        ::= ';'

Functions may be given a parameter list.
Note that given this definition, SIN() has a parameter list with
0 parameters, NOT one parameter that happens to be empty.

TODO: If there is a need for passing 1 parameter that is empty, could
define a function like EMPTY() for representing an empty parameter.


  Reference ::= '[' Source? RangeAddress ']'
  RangeAddress ::=  CellAddress ( ':' CellAddress )? |
                    SheetName "." ColumnLabel ':' SheetName "." 
ColumnLabel |
                    SheetName "." RowLabel ':' SheetName "." RowLabel
  CellAddress ::= SheetName "." ColumnLabel RowLabel
  SheetName   ::= /* empty */ | '$' ( [A-Za-z0-9]+ | ("'" ([^']|"''")+ 
"'" ) )
  ColumnLabel             ::= [A-Z]+
  RowLabel                ::= [1-9] [0-9]*

References always begin with '[' in the storage format; this immediately
disambiguates cell addresses from function names and named expressions.
User interfaces need not (and often do not) display the surrounding [...].
Sheetnames include ' characters by doubling them.
Column labels MUST be in uppercase; this makes it easier to distinguish
between column labels and many named expressions.
The syntax above supports whole-row and whole-column refernces, but they
must always have a range marker.
Note that

TODO: Ideally we'd allow Sheetnames to not start with "$", but then
if given ['...,  is that a start of a SOURCE or a SHEETNAME?

TODO: R1C1 and relative sheetnames for CellAddress?

TODO: Errors.

Note:
See OpenDocument 1.0 section 8.3.1's definition of
cellRangeAddress and cellAddress.
The "$" markings note that the following component is
absolute, otherwise it is relative; these markings have no effect on
calculation (they are intended for use when copying or moving formulas).

*Rationale:* Cell addresses in OpenFormula begin with "[" and end with a
"]"; this makes parsing simpler, faster, and more reliable. Cell
addresses are specified in A1 notation, not an R1C1 notation, as
required by OpenDocument 1.0 section 8.1.3 subsection "Formula". Not
using R1C1 notation can in some cases cause an increase in compressed
file size (because copied formulas are shown differently and thus do not
compress as well). However, A1 notation is much easier for humans to
understand, so using A1 format is likely to increase reliability
(because it is more likely to be correctly generated and interpreted).
If this is a serious problem, Array formulas can usually be used to
achieve the same results. Both R1C1 and A1 /could/ be allowed, but then
different spreadsheets are likely to generate different characters for
the same cell, and thus create "differences" that do not exist even in
simple spreadsheets. Only uppercase characters are allowed, for the same
reason.

*Note:* Typical spreadsheet displays will often /not/ display or require
input of the square brackets. OpenDocument 1.0 section 8.3.1 also
defines a cellRangeAddressList (space-separated ranges) which is used by
several OpenDocument constructs. OpenFormula does not use
cellRangeAddressList; cell concatenation can do the same thing when this
is necessary.

*Rationale:* Supporting whole-rows and whole-columns is useful;
Excel 2003 supports this. OpenOffice 2.0 does not directly
support it, but it fakes it very well by translating a "whole row" or
"whole column" entry into a specific range. As long as spreadsheets only
go from A1 to IV65536, there is no difference, and OpenOffice.org 2.0's
approach is probably fine for many people. However, spreadsheet
implementations of the future are likely to support larger limits, at
which point these spreadsheets will quietly fail. It would be wiser to
directly support these selectors so that large spreadsheets of the
future would not silently fail.


  NamedExpression         ::=     Source? Identifier
  Source ::= "'" URI "'" "#"

A NamedExpression references another expression, possibly in a completely
different spreadsheet.

TODO: In Excel, NamedExpressions can be sheet-specific or global, but
there's no syntax to reference one or the other (a potential problem for
external references in particular).  Should we do better?

The URI in the source_location MUST support absolute URIs (URLs are
URIs). Implementations SHOULD support relative URLs, which can be
distinguished because they do not begin with [A-Za-z]+ ":". If a
relative URL looks like an absolute URL, it MUST save relative URLs
using a "./" prefix, e.g., './mystuff.xls'. Implementations MUST support
they file:/// prefix; they MAY support others, but beware of the
security ramifications (an attacker could use this ability to track
someone when they download a file to extract data from it, or use this
to send malicious data). Implementations SHOULD ask the user before
loading data from any external data sources.

TODO: Discuss IRIs?

TODO: exceptions?


Whitespace

Whitespace (space, tab, newline, and carriage return) is ignored in the
default formulas syntax, except in the contents of string constants. A
line break is normally represented by a single ASCII character 10 (\n);
when embedded in an XML document this is typically represented as
"&#10;". Implementations SHOULD retain whitespace entered by the
original formula creator, and SHOULD NOT add additional whitespace
unless directed to do so during the process of editing a formula.

*Note:* In Microsoft Excel's display format, the space is used as the
intersection operator. This is very confusing, so the OpenFormula
exchange format uses "!" instead for intersection.
[edit </index.php?title=Expression_Syntax&action=edit&section=9>]


Precedence

The operator have the following precedence, from lowest to highest:

Associativity	Operator(s)	Comments
left	<, =, >, <=, >=, <>, !=, ==
Less than, equal to, greater than, less than or equal to, greater than
or equal to, not equal to, not equal to (alternative), equal to
(alternative).

left	&	
Binary operation string concatenation. Note that unary (prefix) + and
"#147; has a different priority. Note that "&" must be escaped when
included in an XML document, typically as "&".

left	+,-	
Binary operations add, subtract. Note that unary (prefix) + and "#147;
have a different priority.

left	
*,/
Multiply, divide. Division does not truncate, so 1/2 is equal to 0.5.

right	
^
Power (2^3 is 8).

left	
%
Postfix unary operator % (divide by 100). Note that this is legal with
expressions (e.g., [.B1]%) and it can be duplicated (1), and it does
/not/ change the meaning of other operations such as "+".

right	
+,-
Prefix unary operators, e.g., -5 or -[.A1]. Note that these have a
difference precedence than add and subtract.

left	
!
Cell address intersection ([.A1:.C4]![.B1:.B5] is [.B1:.B4]).

left
~
Reference union.



TODO: Precedence of ":" (when not inside [...]).

Precedence can be overridden by using parentheses, so "2+3*4" computes
14 while "(2+3)*4" computes 20. Implementations' user interfaces may
display these operators differently or with a different precedence, but
when exchanging formulas they /MUST/ use the precedence rules here.
Implementations /SHOULD/ retain and regenerate "unnecessary" parentheses
and white space, since these are added by people to improve readability.

The precedence of unary prefix "-" compared to "^" has initiated more
than a little research. In this specification prefix "-" has higher
precedence than "^", because "-2^2" is "4" in Microsoft Excel,
OpenOffice.org, Gnumeric, and probably many others. This is not
universal; prefix "-" has a /lower/ precedence on Lotus 1-2-3, Quattro
Pro, and Excel's own Visual Basic (Walkenbach, 2004, pg. 579), so these
products will need to insert and remove parentheses when reading/writing
expressions in OpenFormula where this matters. However, the /vast/
majority of spreadsheet implementations used today treat "-" as higher
precedence, so as usual we identified and documented common conventions.
Using a different precedence would create a high likelihood of error and
additional work in implementations, for no good reason. Ideally
OpenFormula could also be used for attribute draw:formula as defined in
OpenDocument 1.0 section 9.5.5 and attribute anim:formula as defined in
13.3.2; neither of these attributes includes an exponentiation operator
"^", so this issue of precedence causes no problem. (Note that they
require "," as the function separator, and have other requirements, as
discussed later in this document).

Prefix "-" and "+" are right-associative, not non-associative, because
"=--[.B3]" is legal (and it converts B3 to a number, so it can have an
effect). Cell address intersection has an even higher precedence, so
that a unary minus in front an intersection will work correctly.

*Rationale:* Walkenbach gives Microsoft Excel 2003's precedence levels
as (lowest to highest, note that the book gives the reverse order)
comparison (such as "="), "&", "+" and "-", "*" and "/", "^", "%", and
unary "-" (negation). (Walkenbach, 2004, pg 38). By having the
precedence be the same as Excel, it simplifies transition.

*Note:* this format is intentionally similar to traditional
presentations of spreadsheet formulas, which reduces the likelihood of
error or misunderstanding. Although the user interface is not specified
here, the representation is intentionally chosen so that formulas can
"round trip" to this format and back without loss using typical formula
representations. It would be possible to replace some or all uses of "%"
with "/100", and to replace "^" with POWER(), but this would cause the
display format to change once it was saved and reloaded.






MISCELLANEOUS TEXT - stuff yet to be dealt with...


Future versions could handle OpenDocument's
anim:formula and draw:formula. They will need to add variable syntax
such as $0, $a, and $. In addition, the issue of using "," instead of
";" as a function parameter separator will need to be resolved to accept
those formulas. Future versions may also accept constant array
specifiers, e.g., "{" list "}" - and in that list, there must be a way
to separate different values inside a row as well as a way to separato
different rows (Excel uses "," to separate entries within a row, and ";"
to separate entries in different rows).

OpenOffice.org 2.0 always generates absolute Sheetnames if it's not the
current sheet (i.e., '$' always prefixes a Sheetname). However, the
OpenDocument 1.0 specification 8.3.1 clearly makes the "$" optional in
front of a sheetname.

*Discussion:* ??? Named values are actually named expressions in Excel,
and can have references ("the cell row-2 and column+1 from here") and be
reused in many different cells. Thus, there appears to be a real need to
support an R[]C[]-style reference format, possibly at level 3. Probably
should also support an argument format, so that they can be used as
user-defined formulas, using the sytax compatible with the other
formula-like constructs in OpenDocument (e.g., $1). However, once you
start changing the fundamentals of how cell references are made, should
we abandon the OOo storage format? After all, the current syntax already
doesn't well-support arbitrary range operations. The Gnumeric folks
argue for using their syntax instead; details of this proposal have not
yet been seen. See mailing list.

*TBD:* Need to specify how to save relative sheet names, and how to
handle ' embedded in sheet names.

*TBD:* Need to specify how to save "sheet deleted" ("#REF") sheet
references.
[edit </index.php?title=Expression_Syntax&action=edit&section=8>]


An implementation MUST write a cell range address ":" separator, and
MUST NOT write the cell range extension operator, where it is possible
to do so. That is, an implementation MUST write the expression
"[.A1:.A3]" instead of "[.A1]:[.A3]" even though they are semantically
identical. Where there is a choice of which cells to join inside a cell
range address, an implmentation MUST choose the leftmost one.

*Rationale:* This specification does not mandate a cell concatenation
operator at level 2, since OpenOffice.org 1.1.3 and 2.0 do not include
one. However, Microsoft Excel includes this operation. In Excel this is
represented using the comma (",") character, the same symbol used as the
parameter separator for function calls. This is a very poor choice with
a number of unfortunate ramifications. One problem is that
concatentating cells in a function parameter requires surrounding the
cells with additional parentheses in Excel display syntax. For example,
AREAS(A1:A3,B2:B4) is a function call with two parameters, while
AREAS((A1:A3,B2:B4)) is a function call with one parameter. Another
problem is that the comma interferes with the use of "," as a decimal
separator (as it is used in many locales) when using traditional entry
formats (which do not mark cell addresses with "[".."]"). Gnumeric uses
"+" as the cell concatenation operator in its display, but this has its
own problems: it interferes with the use of "+" as a matrix addition
operator. There are many alternatives, e.g., other characters (such as
"~", "|", and "\"), or requiring a function syntax for this purpose. The
character "_" would be a poor choice because formula variables can also
include this character in their name (complicating parsing when "[..]"
are not used "#147; is B3_B2 a formula variable, or are B2 and B3
concatenated?). Because of these issues, this document proposes using
"~" as the cell concatenation symbol.

A special range-extension operator is needed to deal with cases such as
more than one range extension, where one or more sides is a name, or
where one or more sides is calculated as the result of an expression.
But where possible it shouldn't be used, for maximum portability, since
level 2 systems don't necessarily implement it. The leftmost rule is
noted so that implementations are likely to generate the same text for
the same formula, eliminating the appearance of change when none occurs.


Issue: Subtables.
"Level 3 implmentations MUST support addressing subtables (tables in
tables), using this syntax:
    cellRangeAddress ::= cellRangeAddress.level2 |
                         "[A-Z]+" ":" "[A-Z]+" | /* whole column(s) */
                         "[0-9]+" ":" "[0-9]+"   /* whole row(s) */
  cellAddress ::= "$"? sheet_name ( "." in_sheet_address )+

*Rationale:* OpenDocument 1.0 section 8.3.1 discusses subtables, and
requires this syntax when addressing subtables. However, it doesn't
require that subtables actually be supported in its syntax for cellAddress.
[edit </index.php?title=Expression_Syntax&action=edit&section=14>]


         Extension: R1C1 Notation

Level 3 implementations MUST support R1C1 notation.

    cellRangeAddress ::= cellRangeAddress.level2 |
                         "[A-Z]+" ":" "[A-Z]+" | /* whole column(s) */
                         "[0-9]+" ":" "[0-9]+"   /* whole row(s) */
  cellAddress ::= "$"? sheet_name ( "." in_sheet_address )+
  in_sheet_address ::= ( "$"? [A-Z]+ "$"? [0-9]+ ) | 

                       ( [-+$] [0-9]+ [-+$] [0-9]+ )

The symbols + and - indicate a relative reference; "$" indicates an
absolute reference. The initial "r" /MUST/ be in lowercase letters.

*Rationale:* R1C1 notation tends to produce much better compression
results, it supports large spreadsheets (larger than IV65536) more
easily, and better supports textual XML diffs. In earlier versions this
had a lowercase 'r' and 'c', but this was unnecessary.

*TBD:* It's not clear this particular syntax is a good idea; discussion
welcome.


*Rationale:* OpenOffice.org 2.0 supports source locations for Cell
addresses like A1, but not for named values, so this was moved to level 3.
[edit </index.php?title=Expression_Syntax&action=edit&section=16>]


         Extension: In-line arrays

The syntax above does not include a specification for arrays specified
in-line. Microsoft Excel includes a way to write arrays of constants
without having to create a cell range for them. OpenOffice.org 1.1.3
does not support this. Excel's display format uses a comma as the
separator between values in a row, and a semicolon for the separator
between rows. Here semicolon is used as the separator between values in
a row (again, so different locales will have a simpler time entering
data when entering data), and the pipe symbol "|" is proposed as the
symbol separating rows (with absolutely no precedent).

    array_spec ::= "{" matrix ( '_' matrix )* "}"
    matrix ::= ( matrix_row ( column_separator matrix_row )* ) ?
    matrix_row ::= expression ( ";" expression )*
    column_separator ::= "|"

*TBD:* This is a proposed syntax, which in addition permits non-constant
values in the array (something Excel does not permit) and concatenation
(which Excel may not permits either).

Gnumeric is adding the ability to handle matrixes and NOT forcing them
into a fixed cell array; these are called dynamic matrices. Does this
require syntactic extensions? Need to discuss in types and functions.

Need to re-investigate in-line arrays and other array function syntax
issues.

Should we add arguments (e.g., "$1") so that anim:formula, etc. are
easier to support AND so that named expressions are much more effective
(they'd then allow the easy definition of new functions)?




--- David A. Wheeler



[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]