[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: Proposal: Improve definition of formula syntax
I'm delighted with much that I see in the Committee Draft 2, 21 Dec 2004, of the "Open Document Format for Office Applications (OpenDocument) 1.0". It looks like much is being refined. However, I'm still concerned that this specification fails to include enough detail to permit interchange of even simple spreadsheets with computed values. Section 8.1.3 tries to define the syntax of formulas, but while it gives some details, it does not give enough syntax details to permit interchange of trivial formulas between spreadsheet programs. Too much is left undefined, such as what are the set of operators and their representation (is "*" multiply? What's exponentiation?), precedence, number format (is "." or "," the decimal separator?), and so on. If each spreadsheet program chooses different incompatible representations, the result will be unnecessary incompatibilities. Thankfully, this is easily fixed, with a small addition of material in the existing section defining formulas (section 8.1.3). The attached file formula.sxw is about 2.5 pages of text that I believe solves the problem. You can see from the text itself why I think the current text is inadequate. Basically, there's a LOT of things that should be in the spec but weren't. I had to figure out things by editing Calc files, then unzipping them to see what they'd do. Let's help make sure others don't need to do that! For two programs to exchange spreadsheet functions, they must both implement the functions used by the spreadsheet, in the same way sufficient for that spreadsheet. However, I believe a large number of spreadsheets don't need complex support or many functions. I've seen many spreadsheets that could be handled quite nicely by a spreadsheet that only implemented simple calculations (+, -, *, / between cells with parentheses), the functions SUM(), AVERAGE(), MIN(), MAX(), COUNT(), and IF(..) with comparators. It'd be great if there was a standard for a large suite of functions to support interoperability, but the first step is to at least agree on the syntax of formulas and define just a few functions sufficient for these mimimal spreadsheets. So, that's what I've supplied here. It should be a simple cut & paste job, if you like it. Hopefully this would be enough for at least simple interoperation between OpenOffice.org, KOffice's spreadsheet program, Gnumeric, etc... and Excel if they ever implemented this. I urge the committee to consider this, and I hope you'll accept this change. This is not a change in scope of the document; it's simply a clarification of text ALREADY in the document. One comment I received earlier was a concern that this would define a "programming model", but I believe that that was a misunderstanding. This proposal does NOT define any particular programming model. This only defines a way to exchange formulas; the functions that the formulas invoke may be programmable, and the full formula language supported by an implementation might be a fully programmable language, but I'm treating all of that as out-of-scope. The purpose of this text is to more clearly specify what's already in the document, sufficient so you could actually exchange spreadsheets in it. The current text in 8.1.3 already defines a specific, concrete syntax; it has to, to provide any interoperability of formulas. It just doesn't give enough detail to actually meet its goal. I submitted an older version of this on November 1, 2004, but I didn't hear back any comments on it. I'm guessing that the comment was lost. So, I'm resubmitting this slightly modified version of that proposal. The only change is that here I don't even TRY to give a longer list of functions; I just list a very few functions that should be implemented, and leave the idea of defining a long list of functions to some future time (possibly a separate project). For your amusement, I've also attached a .y and .l file that I created. To create this text, I created programs in yacc & lex to make sure that the syntax was defined precisely, and then wrote the text to match... this makes the text far more precise. You don't need to DO anything with the .y and .l files, I'm not suggesting that they be in the document... but I thought you might find them of interest. Please let me know if this was received, and/or if there are objections I could respond to about it. --- David A. Wheeler
%{ #include <stdio.h> /* Formal specification of minimum formula syntax, in bison/yacc form. by David A. Wheeler. Notes: + See OpenOffice specification 22 March 2004, pg 171, section 8.1.3. + Note that the formula syntax used for exchange is NOT necessarily the format displayed to users. E.G., users might see "A1" displayed, meaning cell A1, but that'd be stored and exchanged as "[.A1]". This is perhaps unfortunate, since this means that the exchange format is always different than what is shown the user (beyond locale differences) + Ordinary precedence and parentheses are allowed; 3+2*5 is 13. + You can label cells or cell ranges with IDENTIFIERs, and then refer to them in the spreadsheet; simply use the IDENTIFIER. + Labels may refer to ranges of cells. Currently can't have label:label; that seems like an unnecessary weakness. + Function calls always include the parentheses, and can be given zero or more parameters. If there are parameters, none can be empty. RAND() is legal, and so is SUM(.A1;2;3). RAND is not a legal function call (need the parens). + Note: Parameter separators are semicolons, not commas. This is rather inconsistent with the rest of the world, because Commas are much more common separators for parameters. E.G., C, C++, C#, Java, Ada, Fortran. I presume that the UI uses semicolon so that users with locales that use "," as the decimal point won't get confused. This isn't strictly necessary here, since numbers are C locale, but there IS an advantage to having the internal syntax being similar to what's actually stored and received. But shouldn't the READERS at least understand "," since it's a likely mistake? + Function call names appear to be transformed to all upper case before being transferred, but readers should probably accept either case and NOT treat case as significant (so "sum" and "SUM" are the same). At least, I _think_ that's true - need to confirm. + The logical operators are IF(condition;if-true;if-false), AND(...), OR(...), NOT(x), TRUE(), and FALSE(). + OOo uses stronger typing than other spreadsheets, e.g., OR(1,2) isn't TRUE, but an error. Is that really a good idea, since it will probably impede interoperability and user expectations? + This grammar treats logical operators and function calls as identical, since they have the same syntax. However, I believe the logical operators IF(), AND(), and OR() must short-circuit, e.g., not do excess computational work. As long as there are no functions with side-effects it doesn't matter, but some user-defined functions in some languages COULD have side-effects. An implementation must operate as though it short-circuits, though it may actually compute in parallel as long as only side-effect-free operations are performed (it may not be possible to ensure that some user-defined functions are side-effect free). + Unary minus and plus are allowed, so "-[.B1]" is legal. Here they're handled by the grammar, not the lexer. + Comparison operations (x=y, etc.) return TRUE() or FALSE(). + The operators +,-,*,/,^ convert booleans on either side to a number (false->0, true->1). So (3>2)*5 is 5. */ void process(char *s); void yyerror(char *s); int yydebug=1; %} %token NUMBER STRING CELLRANGEADDRESSLIST IDENTIFIER /* Define precedence and assocation direction */ %left GE LE EQ NE '>' '<' '=' %left '+' '-' '&' %left '*' '/' /* You could argue to reverse the next two, or change to %left UMINUS */ %right '^' %nonassoc UMINUS %start formula %% formula: '=' expr ; parameter: expr {process("Function parameter");} ; nonempty_expr_list: parameter | nonempty_expr_list ';' parameter ; expr_list: /* empty - zero parameters okay. */ | nonempty_expr_list ; /* It's not valid to say IDENTIFIER : IDENTIFIER, but I think it should be. */ /* end_range: | ':' IDENTIFIER ; */ expr: NUMBER {process("Constant number");} | STRING {process("Constant string");} | CELLRANGEADDRESSLIST {process("Cell address(es)");} | IDENTIFIER '(' expr_list ')' {process("Function call");} | IDENTIFIER /* end_range */ {process("Labelled value");} | '-' expr %prec UMINUS {process("Unary minus");} | '+' expr %prec UMINUS {process("Unary plus");} | expr '+' expr {process("add");} | expr '-' expr {process("Subtract");} | expr '&' expr {process("String concatenation");} | expr '*' expr {process("Multiply");} | expr '/' expr {process("Divide");} | expr '^' expr {process("Power");} | expr '<' expr {process("Less-than?");} | expr '>' expr {process("Greater-than?");} | expr GE expr {process("Greater-than-or-equal-to?");} | expr LE expr {process("Less-than-or-equal-to?");} | expr NE expr {process("Not-equal?");} | expr EQ expr {process("Equal?");} | '(' expr ')' {} ; %% void process(char *s) { fprintf(stdout, "%s\n", s); } void yyerror(char *s) { fprintf(stdout, "%s\n", s); } int main(void) { yyparse(); return 0; }
%{ /* Formula lexer by David A. Wheeler. Notes: + Based on info in sections 8.1.3 and 8.3.1 + Numbers are stored in "C" locale ("." decimal separator, no thousands separator, etc.). Remember handle them like this or equivalent: old_locale = strdup (setlocale (LC_NUMERIC, NULL)); setlocale (LC_NUMERIC, "C"); + Numbers can begin with a leading ".", e.g., ".01" is legal. + Numbers can end in %, which divides that number by 100. (Note that 2+10% is 2.1, not 2.2) + Not handled: bases other than base 10. It might be nice to accept an Ada-like syntax, e.g. base#basednumber# where base is 2..36. Issue: if special codes ALSO begin with '#' that makes patterns slightly more interesting (though still doable). Alternatively, can use C's 0x (hex) and 0 (octal), but what about base 2? + Leading - and + signs are allowed (they're implemented here by grammar) + Constant Strings should be in UTF-8 (no need to encode as &#..., and DO NOT re-encode each byte). + Constant strings surrounded with double quotes. + In strings, to include a ", repeat " twice. Note that in the XML, all double-quotes are escaped as " + The "&" operator concatenates; in the XML, represented as & + Need to handle special values (#ERR, #NAN, #NA, etc.) + This doesn't accept "==" for equal, nor "!=" for "not equal to" A _reader_ should probably accept those (and ** for ^). + Tables can't have ' in their name, but need ' to surround other characters (use ' in the XML). + Need to define error values and special values. SHOULD allow #NA (or #N/A), #NV, #NAN (not a number). SHOULD allow #TRUE and #FALSE (instead of only TRUE() and FALSE()). + cellrangeaddresslist - Whitespace is otherwise ignored outside string constants, so it seems inconsistent to suddenly make whitespace meaningful here. + If identifiers are case-insensitive, should writers be told to write uppercase, lowercase, or doesn't matter? */ #include <stdlib.h> #include "y.tab.h" // extern void yyerror(char *message); %} simple_number [0-9]+(\.[0-9]+)?([eE][+-]?[0-9]+)?%? number ((\.[0-9]+)|([0-9]+(\.[0-9]+)?([eE][+-]?[0-9]+)?))%? string \"([^"]|\"\")*\" identifier [A-Za-z][A-Za-z0-9_]* celladdress ($?([^\. ']+|'[^']+'))?\.$?[A-Z]+$?[0-9]+ cellrangeaddress {celladdress}(\:{celladdress})? cellrangeaddresslist {cellrangeaddress}([ \t]+{cellrangeaddress})* %% {number} return NUMBER; {string} return STRING; {identifier} return IDENTIFIER; {cellrangeaddresslist} return CELLRANGEADDRESSLIST; [-+*/^()<>=.,@:\[\]\&;] return *yytext; /* Return the character */ ">=" return GE; "<=" return LE; "<>" return NE; /* Should != or == also be allowed? */ [ \t\n]+ ; /* ignore whitespace */ . yyerror("Unknown character"); %% int yywrap(void) { return 1; /* Halt processing at end of text */ }
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]