office-formula message

Subject: Re: [office-formula] Grammar
From: Eike Rathke <erack@sun.com>
To: office-formula@lists.oasis-open.org
Date: Thu, 2 Mar 2006 21:13:54 +0100
Hi David,

On Wed, Mar 01, 2006 at 16:12:59 -0500, David A. Wheeler wrote:


> Thanks for posting this! Here are a few comments trying to compare this 
> with OpenFormula. Hopefully we can combine the features of these 
> grammars to create the world's best grammar :-).

Sure, that's what it is meant for :)


> >Definition of the Formula Attribute
> >
> >FormulaContent          ::=     Namespace Formula
> >  
> 
> In OpenFormula the namespace was considered external to the syntax, but 
> that's really a presentational nit; the result is identical.  Since ODF 
> mandated this, that's not surprising :-).

The namespace is part of the attribute's content, therefore I thought it
should be part of the grammar here. After all, the grammar will define
one possible content using one specific namespace, other
contents/grammars not following the ODFF approach may and already do
have different namespaces, for example 'oooc' and 'ooow'.

> Note that there are LOTS of other places where ODF uses formula-like 
> expressions; I'd like to be able to expand and cover those too, or at 
> least make it easy for implementations to reuse code for them.

We should concentrate on this one attribute, after having defined
spreadsheet needs other applications/usages MAY follow it in other
formula-like attributes, if ODF allows so.


> >Formula                 ::=     '=' '='? S* Expression S* Expression*
> >
> >If a second '=' is present, the formula has to be recalculated whenever
> >one of its predecessors changes value. This can be used to force formula
> >cells to be recalculated that contain calls to macros or AddIns with
> >side effects. If no second '=' is present, the cell can be recalculated at 
> >any time when needed.
> >  
> This second "=" is odd; I don't see the value of this capability.  Could 
> you enlighten me?
> 
> In my mind, shouldn't implementations figure out when to recalculate, 
> rather than trying to embed this in the syntax?  In most implementations 
> I'm aware of, formulas are normally ALWAYS recalculated if their 
> predecessors change (assuming automatic recalc is on).

As most other applications, OOoCalc only recalculate a formula cell when
the result is accessed, e.g. the cell is displayed or the result is
needed by another cell that is to be calculated. Also during idle phases
parts of the document may get recalculated. Until then, dependent cells
may be queued in a "dirty" state. The '==' forced calculation ensures
that specific cells are recalculated not only when needed, for example
if they contain some log-writing macro function call in a "real time"
environment.

> But if you have 
> manual recalc on, then this second "=" shouldn't have an effect anyway.  

Sure, then it doesn't matter.


> >WhiteSpace (S)
> >S                       ::=     #x20
> >  
> Many spreadsheet implementations, including Excel, also allow newline 
> (\n).  Obviously in an XML attribute newlines are converted into their 
> XML-escaped form.

Ah, true, forgot about that one. An application should be able to
preserve it as long as the formula isn't edited, but it shouldn't be
required to actually let it act like a newline.

Looking at OpenFormula 7.7 Whitespace, it also mentions carriage return
and tab as whitespace characters. Which applications allow them?


> >Expression              ::=     Number |
> >                                String |
> >                                Array |
> >                                PrefixOp S* (Expression - String) |
> >                                (Expression - String) S* PostfixOp |
> >                                Expression S* InfixOp S* Expression |
> >                                '(' S* Expression S* ')' |
> >                                FunctionName S* '(' S* ParameterList? S* 
> >                                ')' |
> >                                Reference |
> >                                NamedExpression
> >  
> OpenFormula's _presentation_ of spaces is different, but the result 
> APPEARS to be the same.

There's a subtle difference with Natural Language Formula (NLF)
intersections, where the space is the intersection operator. In the UI
that is, where I derived the grammar from. The stored operator could of
course be something else, however, it should be different from the
standard '!' intersection operator to be distinguishable for UI
representation.


> I disagree with the use of (Expression-String) everywhere, though.  
> OpenFormula doesn't do that.  Strings are expressions, and should be 
> treated as expressions with data type string. If a particular OPERATOR 
> doesn't like the string type, that's a data type error... not a syntax 
> error.

Agreed.


> The OpenFormula syntax uses the term "formula_variable" instead of 
> "NamedExpression", and in comparison I think that's a weakness of 
> OpenFormula's naming.   I think "NamedExpression" is a far more accurate 
> name for that.

Thanks.


> >Number                  ::=     [0-9]+ ('.' [0-9]+)? ([eE] [-+]? [0-9]+)?
> >
> OpenFormula has the same idea.  It requires the ability to READ 
> leading-"." numbers, and includes the syntax.

Btw, for readability we should use spaces in the EBNF to separate
elements, and use single quotes for literals. I actually find my version
much easier to read than OFs [0-9]+(\.[0-9]+)?([eE][+-]?[0-9]+)?


> >String                  ::=     '"' ([^"#x00-#x1f] | '""')* '"'
> >
> >A literal double-quote character (") as string content is escaped by
> >duplicating it. All content is UTF-8 encoded.
> >Note that since the formula is stored as an XML attribute, all
> >double-quotes are written as their entity &quot;
> >  
> I don't think we want to mandate UTF-8 encoding... that's really the 
> business of the enclosing XML.

Hmm.. true.


> >Array                   ::= TODO, which separators?
> >  
> OpenFormula does this: "Here semicolon is used as the separator between 
> values in a row (again, so different locales will have a simpler time 
> entering data when entering data), and the pipe symbol "|" is proposed 
> as the symbol separating rows (with absolutely no precedent)."

Sound both reasonable. However, OFs "7.8.7 Extension: In-line arrays"
talks about matrix_row and column_separator. This doesn't hold. The
"row" is a one-dimensional vector instead, and whether it gets used as
a row or a column may depend on the function taking it as an argument.


> It also supports multi-arrays with "_"... I don't remember that being IN 
> there, so it probably should come out :-).

Nice idea though :)


> >PrefixOp                ::=     '+' | '-'
> >
> >Unary operators.
> >  
> 
> OpenFormula calls these "unary_op"... they should probably be called 
> "PrefixOp" since a postfix op is ALSO a unary op.

That's what I thought..


> >InfixOp                 ::=     ArithmeticOp | ComparisonOp | '&'
> >
> >The '&' ampersand is the string concatenation operator.
> >Note that since the formula is stored as an XML attribute, an '&'
> >ampersand is written as the entity &amp;
> >  
> This treats cell intersection as NOT an infix operator, which I think is 
> suboptimal.  By trying to separate the cell intersection operator 
> syntactically, precedence is dealt with nonuniformly.

By excluding the intersection I wanted to express the fact that Number
or String or Array expressions don't make sense as intersection
operands. This could of course also be left as a data type error again.
On the other hand, it clarifies things already in the beginning.


> >Parameter               ::=     Expression | ReferenceList
> >  
> Why this distinction?  Shouldn't a "referenceList" be an expression?

It can't be any expression, only a list of references. Where a single
reference of course can be an expression that results in a reference.


> >ReferenceList           ::=     '(' S* Reference ( S* ';' S* Reference )* 
> >S* ')'
> >
> >A ReferenceList as one argument is only accepted by spreadsheet
> >functions that handle a cell range at this parameter place.
> >  
> 
> Is this a "cell union"?

Yes. In fact the cell "union" is not a union, but a list instead.
A union would unify overlapping ranges, the list does not. (A1:A3;A2)
evaluates A2 twice.

> If so, the double-use of ";" to separate function
> parameters AND to make a union is awkward.

The ReferenceList is always enclosed in '(' ')' parentheses with no
preceding identifier, to distinguish it from a list of function
parameters. Same as in Excel UI ...


> >Reference               ::=     CellReference |
> >                                RangeReference |
> >                                Intersection |
> >                                ColumnLable |
> >                                RowLable
> >
> >Intersection            ::=     Reference S* '!' S* Reference |
> >                                ColumnLable S+ RowLable |
> >                                RowLable S+ ColumnLable
> >  
> 
> Here the syntax treats "!" specially.  I think it'd be easier to keep the
> syntax uniform if we just treat "!" as yet-another-operator that happens
> to work on references.  Otherwise, it'll be easy to create a syntax that
> duplicates all sorts of crazy things.

Actually intersection _is_ a crazy thing, if you take NLFs with
ColumnLable and RowLable into account..


> >RangeReference          ::=     CellReference ':' CellReference |
> >                                '[' RangeAddress ']' |
> >                                NamedRangeReference
> >        TODO: whitespace if range operator with name,
> >        but no whitespace if with cell addresses.
> >  
> 
> Interesting.  Like OpenFormula, this means both [.A1:.A2] and 
> [.A1]:[.A2] are
> legal.  I think the latter is not currently accepted by OOo2, but we 
> need it to
> support some capabilities.

Yes, the ':' flexible range operator isn't supported by OOo yet, it's on
the TODO list.


> >ColumnName              ::=     [a-zA-Z]+
> >
> >Column names are A..Z, AA..ZZ, AAA..ZZZ, ...
> >  
> 
> OpenFormula _mandates_ uppercase; there's really no need to allow 
> variance.  EVERYBODY sends column names in uppercase, and doing this 
> allows detection of problems (like variable names accidentally 
> considered to be cell addresses).

I don't see how mandating uppercase could prevent a clash with
a NameIdentifier of the same name, which can have both upper and lower
case. Note that though case is preserved, comparing names of
NameIdentifiers is done case insensitive.


> >NameIdentifier          ::=     Identifier - CellAddress - RangeAddress
> >  
> 
> Too harsh.  QTR4 is a CellAddress, and I expect spreadsheets to continue 
> to widen over the years.

Ok, wishful thinking. Assuring that a NameIdentifier doesn't clash with
a CellAddress is a runtime requirement. An application with ZZZ columns
reading a document stored by a ZZ columns application that uses a QTR4
NameIdentifier MUST make sure that QTR4 is taken as a NameIdentifier and
not a CellAddress. However, it should not allow the user to create
a NameIdentifier that could be a CellAddress. This results in
a requirement of an evaluation order during the formula compile process:
evaluate NameIdentifier before CellAddress.


> >TODO: operator precedence
> >  
> 
> OpenFormula proposed an operator precedence hierarchy.  This is easier 
> to do if you consider operators like "!" and ":" to be simply infix 
> operators.

Seen that. We'll have to add the NLF intersection operator.


> >Data Types in Parameters and Return Values
> >
> >NumericValue            ::= Number
> >
> >DateSerial              ::= NumericValue
> >  
> 
> We need these, but is there a need for a special syntactic 
> representation here? I don't know of a reason to do so.

Defining those once might ease the definitions of functions. Again, this
can be shifted elsewhere, data type definitions, and doesn't need to be
in the basis syntax.


  Eike
Follow-Ups:
- Re: [office-formula] Grammar
  - From: "David A. Wheeler" <dwheeler@dwheeler.com>
- Re: [office-formula] Grammar
  - From: "David A. Wheeler" <dwheeler@dwheeler.com>
- Re: [office-formula] Grammar
  - From: "David A. Wheeler" <dwheeler@dwheeler.com>
References:
- Grammar
  - From: Eike Rathke <erack@sun.com>
- Re: [office-formula] Grammar
  - From: "David A. Wheeler" <dwheeler@dwheeler.com>