office-comment message

Subject: Proposal: Improve definition of formula syntax

From: "David A. Wheeler" <dwheeler@dwheeler.com>
To: Michael Brauer - Sun Germany - ham02 - Hamburg <Michael.Brauer@Sun.COM>
Date: Tue, 04 Jan 2005 00:36:38 -0500

I'm delighted with much that I see in the
Committee Draft 2, 21 Dec 2004, of the
"Open Document Format for Office Applications (OpenDocument) 1.0".
It looks like much is being refined.

However, I'm still concerned that this specification fails
to include enough detail to permit interchange of even simple
spreadsheets with computed values.  Section 8.1.3 tries to
define the syntax of formulas, but while it gives some details,
it does not give enough syntax details to permit
interchange of trivial formulas between spreadsheet programs.
Too much is left undefined, such as what are the
set of operators and their representation (is "*" multiply?
What's exponentiation?), precedence, number format
(is "." or "," the decimal separator?), and so on.
If each spreadsheet program chooses different
incompatible representations, the result will be
unnecessary incompatibilities.

Thankfully, this is easily fixed, with a small addition of
material in the existing section defining formulas (section 8.1.3).

The attached file formula.sxw is about 2.5 pages of text that
I believe solves the problem.  You can see from the text
itself why I think the current text is inadequate.
Basically, there's a LOT of things that should be in
the spec but weren't. I had to figure out things by editing
Calc files, then unzipping them to see what they'd do.
Let's help make sure others don't need to do that!

For two programs to exchange spreadsheet functions, they
must both implement the functions used by the spreadsheet,
in the same way sufficient for that spreadsheet.
However, I believe a large number of spreadsheets don't need
complex support or many functions.  I've seen many
spreadsheets that could be handled quite nicely by a spreadsheet
that only implemented simple calculations (+, -, *, / between cells
with parentheses), the functions SUM(),
AVERAGE(), MIN(), MAX(), COUNT(), and IF(..) with comparators.
It'd be great if there was a standard for a large suite
of functions to support interoperability, but the first
step is to at least agree on the syntax of formulas and
define just a few functions sufficient for these mimimal
spreadsheets.  So, that's what I've supplied here.
It should be a simple cut & paste job, if you like it.

Hopefully this would be enough for at least simple
interoperation between OpenOffice.org, KOffice's spreadsheet
program, Gnumeric, etc... and Excel if they ever
implemented this.

I urge the committee to consider this, and I hope you'll
accept this change.

This is not a change in scope of the document; it's simply
a clarification of text ALREADY in the document.
One comment I received earlier was a concern that this would
define a "programming model", but I believe that that was a
misunderstanding.  This proposal does NOT define any particular
programming model.  This only defines a way to exchange formulas;
the functions that the formulas invoke may be programmable,
and the full formula language supported by an implementation
might be a fully programmable language,
but I'm treating all of that as out-of-scope.
The purpose of this text is to more clearly specify what's
already in the document, sufficient so you could actually exchange
spreadsheets in it.  The current text in 8.1.3 already
defines a specific, concrete syntax; it has to, to provide
any interoperability of formulas.  It just doesn't give
enough detail to actually meet its goal.

I submitted an older version of this on November 1, 2004,
but I didn't hear back any comments on it.  I'm guessing
that the comment was lost.  So, I'm resubmitting this
slightly modified version of that proposal.
The only change is that here I don't even TRY to give a
longer list of functions; I just list a very few functions
that should be implemented, and leave the idea of defining
a long list of functions to some future time (possibly a
separate project).

For your amusement, I've also attached a .y and .l file
that I created.  To create this text, I created programs in
yacc & lex to make sure that the syntax was defined precisely,
and then wrote the text to match... this makes the text far
more precise.  You don't need to DO anything with the .y and .l
files, I'm not suggesting that they be in the document... but I
thought you might find them of interest.

Please let me know if this was received, and/or if there
are objections I could respond to about it.

--- David A. Wheeler

formula.sxw

%{
#include <stdio.h>
/* Formal specification of minimum formula syntax, in bison/yacc form.
 by David A. Wheeler.
 Notes:
 + See OpenOffice specification 22 March 2004, pg 171, section 8.1.3.
 + Note that the formula syntax used for exchange is NOT necessarily the
   format displayed to users.  E.G., users might see "A1" displayed, meaning
   cell A1, but that'd be stored and exchanged as "[.A1]".
   This is perhaps unfortunate, since this means that the exchange format
   is always different than what is shown the user (beyond locale differences)
 + Ordinary precedence and parentheses are allowed; 3+2*5 is 13.
 + You can label cells or cell ranges with IDENTIFIERs, and then refer
   to them in the spreadsheet; simply use the IDENTIFIER.
 + Labels may refer to ranges of cells. Currently can't have label:label;
   that seems like an unnecessary weakness.
 + Function calls always include the parentheses, and
   can be given zero or more parameters.  If there are parameters,
   none can be empty.  RAND() is legal, and so is SUM(.A1;2;3).
   RAND is not a legal function call (need the parens).
 + Note: Parameter separators are semicolons, not commas.
   This is rather inconsistent with the rest of the world, because
   Commas are much more common separators for parameters.
   E.G., C, C++, C#, Java, Ada, Fortran.
   I presume that the UI uses semicolon so that users with locales that
   use "," as the decimal point won't get confused.
   This isn't strictly necessary here, since numbers are C locale, but
   there IS an advantage to having the internal syntax being similar
   to what's actually stored and received.  But shouldn't the READERS
   at least understand "," since it's a likely mistake?
 + Function call names appear to be transformed to all upper case
   before being transferred, but readers should probably accept either
   case and NOT treat case as significant (so "sum" and "SUM" are the same).
   At least, I _think_ that's true - need to confirm.
 + The logical operators are IF(condition;if-true;if-false), AND(...),
   OR(...), NOT(x), TRUE(), and FALSE().
 + OOo uses stronger typing than other spreadsheets, e.g.,
   OR(1,2) isn't TRUE, but an error. Is that really a good idea, since
   it will probably impede interoperability and user expectations?
 + This grammar treats logical operators and function calls as identical,
   since they have the same syntax.  However, I believe the
   logical operators IF(), AND(), and OR() must short-circuit,
   e.g., not do excess computational work.
   As long as there are no functions with side-effects it doesn't matter,
   but some user-defined functions in some languages COULD have side-effects.
   An implementation must operate as though it
   short-circuits, though it may actually compute in parallel as long as
   only side-effect-free operations are performed (it may not be possible
   to ensure that some user-defined functions are side-effect free).
 + Unary minus and plus are allowed, so "-[.B1]" is legal.
   Here they're handled by the grammar, not the lexer.
 + Comparison operations (x=y, etc.) return TRUE() or FALSE().
 + The operators +,-,*,/,^ convert booleans on either side to a number
   (false->0, true->1).  So (3>2)*5 is 5.

 */
void process(char *s);
void yyerror(char *s);
int yydebug=1;
%}

%token NUMBER STRING CELLRANGEADDRESSLIST IDENTIFIER

/* Define precedence and assocation direction */
%left GE LE EQ NE '>' '<' '='
%left '+' '-' '&'
%left '*' '/'
/* You could argue to reverse the next two, or change to %left UMINUS */
%right '^'
%nonassoc UMINUS

%start formula

%%

formula: '=' expr ;

parameter: expr {process("Function parameter");} ;

nonempty_expr_list: parameter
	   | nonempty_expr_list ';' parameter ;

expr_list: /* empty - zero parameters okay. */
           | nonempty_expr_list ;

/* It's not valid to say IDENTIFIER : IDENTIFIER, but I think it should be. */
/* end_range: | ':' IDENTIFIER ; */

expr:   NUMBER                  {process("Constant number");}
        | STRING                {process("Constant string");}
        | CELLRANGEADDRESSLIST  {process("Cell address(es)");}
        | IDENTIFIER '(' expr_list ')' {process("Function call");}
        | IDENTIFIER /* end_range */ {process("Labelled value");}
        | '-' expr %prec UMINUS {process("Unary minus");}
        | '+' expr %prec UMINUS {process("Unary plus");}
        | expr '+' expr         {process("add");}
        | expr '-' expr         {process("Subtract");}
        | expr '&' expr         {process("String concatenation");}
        | expr '*' expr         {process("Multiply");}
        | expr '/' expr         {process("Divide");}
        | expr '^' expr         {process("Power");}
        | expr '<' expr         {process("Less-than?");}
        | expr '>' expr         {process("Greater-than?");}
        | expr GE expr          {process("Greater-than-or-equal-to?");}
        | expr LE expr          {process("Less-than-or-equal-to?");}
        | expr NE expr          {process("Not-equal?");}
        | expr EQ expr          {process("Equal?");}
        | '(' expr ')'          {}
        ;

%%

void process(char *s) {
    fprintf(stdout, "%s\n", s);
}

void yyerror(char *s) {
    fprintf(stdout, "%s\n", s);
}

int main(void) {
    yyparse();
    return 0;
}

%{
/* Formula lexer by David A. Wheeler.
   Notes:
   + Based on info in sections 8.1.3 and 8.3.1
   + Numbers are stored in "C" locale ("." decimal separator, no thousands
     separator, etc.).  Remember handle them like this or equivalent:
       old_locale = strdup (setlocale (LC_NUMERIC, NULL));
       setlocale (LC_NUMERIC, "C");
   + Numbers can begin with a leading ".", e.g., ".01" is legal.
   + Numbers can end in %, which divides that number by 100.
     (Note that 2+10% is 2.1, not 2.2)
   + Not handled: bases other than base 10.  It might be nice to accept
     an Ada-like syntax, e.g. base#basednumber# where base is 2..36.
     Issue: if special codes ALSO begin with '#' that makes patterns
     slightly more interesting (though still doable).
     Alternatively, can use C's 0x (hex) and 0 (octal), but what about
     base 2?
   + Leading - and + signs are allowed (they're implemented here by grammar)
   + Constant Strings should be in UTF-8 (no need to encode as &#..., and
     DO NOT re-encode each byte).
   + Constant strings surrounded with double quotes.
   + In strings, to include a ", repeat " twice.  Note that in the XML,
     all double-quotes are escaped as &quot;
   + The "&" operator concatenates; in the XML, represented as &amp;
   + Need to handle special values (#ERR, #NAN, #NA, etc.)
   + This doesn't accept "==" for equal, nor "!=" for "not equal to"
     A _reader_ should probably accept those (and ** for ^).
   + Tables can't have ' in their name, but need ' to surround other
     characters (use &apos; in the XML).
   + Need to define error values and special values.
     SHOULD allow #NA (or #N/A), #NV, #NAN (not a number).
     SHOULD allow #TRUE and #FALSE (instead of only TRUE() and FALSE()).
   + cellrangeaddresslist - Whitespace
     is otherwise ignored outside string constants, so it seems
     inconsistent to suddenly make whitespace meaningful here.
   + If identifiers are case-insensitive, should writers be told
     to write uppercase, lowercase, or doesn't matter?
*/
#include <stdlib.h>
#include "y.tab.h"
// extern void yyerror(char *message);
%}

simple_number	[0-9]+(\.[0-9]+)?([eE][+-]?[0-9]+)?%?
number	        ((\.[0-9]+)|([0-9]+(\.[0-9]+)?([eE][+-]?[0-9]+)?))%?
string          \"([^"]|\"\")*\"

identifier	[A-Za-z][A-Za-z0-9_]*

celladdress     ($?([^\. ']+|'[^']+'))?\.$?[A-Z]+$?[0-9]+
cellrangeaddress {celladdress}(\:{celladdress})?
cellrangeaddresslist {cellrangeaddress}([ \t]+{cellrangeaddress})*

%%

{number}         return NUMBER;
{string}         return STRING;
{identifier}     return IDENTIFIER;
{cellrangeaddresslist}  return CELLRANGEADDRESSLIST;

[-+*/^()<>=.,@:\[\]\&;] return *yytext; /* Return the character */

">="             return GE;
"<="             return LE;
"<>"             return NE; /* Should != or == also be allowed? */

[ \t\n]+         ;       /* ignore whitespace */

.               yyerror("Unknown character");

%%
int yywrap(void) {
    return 1;  /* Halt processing at end of text */
}