office-formula message

Subject: Re: [office-formula] Grammar
From: "David A. Wheeler" <dwheeler@dwheeler.com>
To: office-formula@lists.oasis-open.org
Date: Wed, 01 Mar 2006 16:12:59 -0500
Eike Rathke wrote:
> Hi,
>
> Here's some grammar I developed having the capabilities of Excel and
> OOoCalc in mind. Most things are similar to OpenFormula, some are
> a little bit different.
Thanks for posting this! Here are a few comments trying to compare this 
with OpenFormula. Hopefully we can combine the features of these 
grammars to create the world's best grammar :-).

>  If we had a wiki I'd placed it there..
>   
Understood :-(.  I've sent an email asking for status-of-the-Wiki.

> Definition of the Formula Attribute
>
> FormulaContent          ::=     Namespace Formula
>   

In OpenFormula the namespace was considered external to the syntax, but 
that's really a presentational nit; the result is identical.  Since ODF 
mandated this, that's not surprising :-).

Note that there are LOTS of other places where ODF uses formula-like 
expressions; I'd like to be able to expand and cover those too, or at 
least make it easy for implementations to reuse code for them.


>
> Formula                 ::=     '=' '='? S* Expression S* Expression*
>
> If a second '=' is present, the formula has to be recalculated whenever
> one of its predecessors changes value. This can be used to force formula
> cells to be recalculated that contain calls to macros or AddIns with
> side effects. If no second '=' is present, the cell can be recalculated at any time when needed.
>   
This second "=" is odd; I don't see the value of this capability.  Could 
you enlighten me?

In my mind, shouldn't implementations figure out when to recalculate, 
rather than trying to embed this in the syntax?  In most implementations 
I'm aware of, formulas are normally ALWAYS recalculated if their 
predecessors change (assuming automatic recalc is on).  But if you have 
manual recalc on, then this second "=" shouldn't have an effect anyway.  
For those few functions where figuring out the predecessor is "too hard" 
(aka the "volatile functions"), they'll always recalc, but that's a 
trade-off between implementation complexity and run-time effort that in 
my mind.  If an implementor WANTS to write more code to be clever about 
reducing recalculation time, they should be allowed to do so.

Contrast this with OpenFormula section 5.2:


"Implementations of OpenFormula recalculate formulas when its 
information is needed. Typical implementations will note what values a 
formula depends on, and when those dependent values are changed, it will 
re-execute the formulas that depend on them to produce the new results 
(choosing the formulas in the right order based on their dependencies). 
Implementations may recalculate when a value changes (this is termed 
/automatic recalculation/) or on user command (this is termed /manual 
recalculation/)."

"Some functions' dependencies are difficult to determine and/or should 
be recalculated more frequently. These include functions that return 
today's date or time, random number generator functions (such as RAND 
<http://www.openformula.org/RAND>), or ones that indirectly determine 
the cells to act on. Many implementations /always/ recalculate formulas 
including such functions whenever a recalculation occurs. Functions that 
are /always/ recalculated whenever a recalculation occurs are termed 
/volatile/ functions. Functions that are often volatile functions 
include AREAS 
<http://www.openformula.org/index.php?title=AREAS&action=edit>, CELL 
<http://www.openformula.org/index.php?title=CELL&action=edit>, COLUMNS 
<http://www.openformula.org/COLUMNS>, INDEX 
<http://www.openformula.org/INDEX>, INDIRECT 
<http://www.openformula.org/index.php?title=INDIRECT&action=edit>, NOW 
<http://www.openformula.org/NOW>, OFFSET 
<http://www.openformula.org/index.php?title=OFFSET&action=edit>, RAND 
<http://www.openformula.org/RAND>, ROWS 
<http://www.openformula.org/ROWS>, and TODAY 
<http://www.openformula.org/TODAY>."




>
> WhiteSpace (S)
> S                       ::=     #x20
>   
Many spreadsheet implementations, including Excel, also allow newline 
(\n).  Obviously in an XML attribute newlines are converted into their 
XML-escaped form.

>
> Expression              ::=     Number |
>                                 String |
>                                 Array |
>                                 PrefixOp S* (Expression - String) |
>                                 (Expression - String) S* PostfixOp |
>                                 Expression S* InfixOp S* Expression |
>                                 '(' S* Expression S* ')' |
>                                 FunctionName S* '(' S* ParameterList? S* ')' |
>                                 Reference |
>                                 NamedExpression
>   
OpenFormula's _presentation_ of spaces is different, but the result 
APPEARS to be the same.

I disagree with the use of (Expression-String) everywhere, though.  
OpenFormula doesn't do that.  Strings are expressions, and should be 
treated as expressions with data type string. If a particular OPERATOR 
doesn't like the string type, that's a data type error... not a syntax 
error.

The OpenFormula syntax uses the term "formula_variable" instead of 
"NamedExpression", and in comparison I think that's a weakness of 
OpenFormula's naming.   I think "NamedExpression" is a far more accurate 
name for that.


> Number                  ::=     [0-9]+ ('.' [0-9]+)? ([eE] [-+]? [0-9]+)?
>
> According to the "C" or en-US locale, the '.' dot is used as the decimal
> separator, group (AKA thousand) separators are not written. "E" or "e"
> denote scientific notation. It is advisable that readers are able to
> read a fraction that starts with '.' without a leading zero, as there
> may be implementations that don't write a leading zero on such numbers.
>   

OpenFormula has the same idea.  It requires the ability to READ 
leading-"." numbers, and includes the syntax.

>
> String                  ::=     '"' ([^"#x00-#x1f] | '""')* '"'
>
> A literal double-quote character (") as string content is escaped by
> duplicating it. All content is UTF-8 encoded.
> Note that since the formula is stored as an XML attribute, all
> double-quotes are written as their entity &quot;
>   
I don't think we want to mandate UTF-8 encoding... that's really the 
business of the enclosing XML.


> Array                   ::= TODO, which separators?
>   
OpenFormula does this: "Here semicolon is used as the separator between 
values in a row (again, so different locales will have a simpler time 
entering data when entering data), and the pipe symbol "|" is proposed 
as the symbol separating rows (with absolutely no precedent)."

It also supports multi-arrays with "_"... I don't remember that being IN 
there, so it probably should come out :-).


> PrefixOp                ::=     '+' | '-'
>
> Unary operators.
>   

OpenFormula calls these "unary_op"... they should probably be called 
"PrefixOp" since a postfix op is ALSO a unary op.

>
> PostfixOp               ::=     '%'
>
> Unary percentage operator, dividing the preceding expression by 100.
>   
Yup. Everyone seems to do it this way.

> InfixOp                 ::=     ArithmeticOp | ComparisonOp | '&'
>
> The '&' ampersand is the string concatenation operator.
> Note that since the formula is stored as an XML attribute, an '&'
> ampersand is written as the entity &amp;
>   
This treats cell intersection as NOT an infix operator, which I think is 
suboptimal.  By trying to separate the cell intersection operator 
syntactically, precedence is dealt with nonuniformly.

>
> ArithmeticOp            ::=     '+' | '-' | '*' | '/' | '^'
>
> Addition, Substraction, Multiplication, Division, Exponentiation.
>
>
> ComparisonOp            ::=     '=' | '<>' | '<' | '>' | '<=' | '>='
>
> EqualTo, UnequalTo, LessThan, GreaterThan, LessThanOrEqualTo,
> GreaterThanOrEqualTo.
>
>
> FunctionName            ::=     Identifier
>
> Note that in practice a FunctionName normally is stored using its
> English form and not the translated UI representation, thus conforms to
> [A-Za-z] [A-Za-z0-9_.]*
> However, in theory all letter characters as defined for an Identifier are allowed.
>
>
> Identifier              ::=     LetterXML (LetterXML | DigitXML | '_' | '.' |
>                                 CombiningCharXML)*
> LetterXML               ::=     http://www.w3.org/TR/REC-xml/#NT-Letter
> DigitXML                ::=     http://www.w3.org/TR/REC-xml/#NT-Digit
> CombiningCharXML        ::=     http://www.w3.org/TR/REC-xml/#NT-CombiningChar
>
> ParameterList           ::=     Parameter ( S* ';' S* Parameter )*
>
> Parameter               ::=     Expression | ReferenceList
>   
Why this distinction?  Shouldn't a "referenceList" be an expression?

> ReferenceList           ::=     '(' S* Reference ( S* ';' S* Reference )* S* ')'
>
> A ReferenceList as one argument is only accepted by spreadsheet
> functions that handle a cell range at this parameter place.
>   

Is this a "cell union"?  If so, the double-use of ";" to separate function
parameters AND to make a union is awkward.

>
> Reference               ::=     CellReference |
>                                 RangeReference |
>                                 Intersection |
>                                 ColumnLable |
>                                 RowLable
>
> Intersection            ::=     Reference S* '!' S* Reference |
>                                 ColumnLable S+ RowLable |
>                                 RowLable S+ ColumnLable
>   

Here the syntax treats "!" specially.  I think it'd be easier to keep the
syntax uniform if we just treat "!" as yet-another-operator that happens
to work on references.  Otherwise, it'll be easy to create a syntax that
duplicates all sorts of crazy things.

> ColumnLable             ::=     TODO
>
> RowLabel                ::=     TODO
>
> RangeReference          ::=     CellReference ':' CellReference |
>                                 '[' RangeAddress ']' |
>                                 NamedRangeReference
>         TODO: whitespace if range operator with name,
>         but no whitespace if with cell addresses.
>   

Interesting.  Like OpenFormula, this means both [.A1:.A2] and 
[.A1]:[.A2] are
legal.  I think the latter is not currently accepted by OOo2, but we 
need it to
support some capabilities.

> RangeAddress            ::=     CellAddress ':' CellAddress
>
> CellReference           ::=     '[' CellAddress ']' | NamedCellReference
>
> CellAddress             ::=     SheetName? '.' ColumnName RowNumber
>
> If a CellAddress points to a sheet or column or row that got deleted,
> the corresponding part of the address is set to '#REF' instead. For
> example #REF.A1 was referring a now deleted sheet.
>
>
> ColumnName              ::=     [a-zA-Z]+
>
> Column names are A..Z, AA..ZZ, AAA..ZZZ, ...
>   

OpenFormula _mandates_ uppercase; there's really no need to allow 
variance.  EVERYBODY sends column names in uppercase, and doing this 
allows detection of problems (like variable names accidentally 
considered to be cell addresses).


> NameIdentifier          ::=     Identifier - CellAddress - RangeAddress
>   

Too harsh.  QTR4 is a CellAddress, and I expect spreadsheets to continue 
to widen over the years.


> TODO: operator precedence
>   

OpenFormula proposed an operator precedence hierarchy.  This is easier 
to do if you consider operators like "!" and ":" to be simply infix 
operators.

> Data Types in Parameters and Return Values
>
> NumericValue            ::= Number
>
> DateSerial              ::= NumericValue
>   

We need these, but is there a need for a special syntactic 
representation here? I don't know of a reason to do so.


--- David A. Wheeler
Follow-Ups:
- Re: [office-formula] Grammar
  - From: Eike Rathke <erack@sun.com>
References:
- Grammar
  - From: Eike Rathke <erack@sun.com>