office-formula message

Subject: Re: [office] Syntax Comments (Weir)
From: "David A. Wheeler" <dwheeler@dwheeler.com>
To: office-formula@lists.oasis-open.org
Date: Mon, 21 Aug 2006 19:16:16 -0400
Here are my responses to Robert Weir's very helpful comments on the
syntax section.  Please dive in with your comments, if any, so we have
the best possible spec.

robert_weir:
> 5.1 --   We seem to say first that the namespace is optional, but then 
> that an application should not include this namespace, as it is 
> unnecessary.  I probably don't disagree with this section, just a 
> little confused.
Generally, a leading "=" is all you need.  But since OTHER formats will 
use a
namespace, it seems inconsistent to not have a name for the default 
formula format too.
Which is why it's worded that way.

Can you suggest a better wording?  Or do you think that the 
inconsistency is okay?

> Namespace_in_XML -- Are we specifically bringing in what the XML 
> Namespace grammar calls "PrefixedAttName"? 
>  (http://www.w3.org/TR/REC-xml-names/#NT-PrefixedAttName)
I need to check that.  Does someone else know that answer off the top of 
their head?
> 5.2 -- Forced Recalculation, can that be done in other ways, without 
> touching the syntax?  For example, some functions could be declared to 
> be transient, like RAND() giving different values on each recalc, 
> independently of any special marker.  If we can have some notion of 
> transient functions, then we could allow implementations to expose 
> that property to custom extensions functions as well, so these 
> functions could declare if they are free of side effects or not.
Many functions are volatile, but that's not the issue in view.
This solves a different problem, I believe. The problem as I understand it
is that some functions can pick off arbitrary cells without there
being an obvious dependency, so it's helpful to be able to force a recalc
that wouldn't be determined from the usual dependency map.

Eike: Since you're the original proposer of this, can you step in to explain
this better than I have?

> 5.3 Constant numbers -- the BNF does not seem to allow negative 
> numbers?  Is there a good reason for making implementations use the 
> prefix - operator to simulate this?
Yes, because this way makes the BNF very clean.  Many other language 
definitions do
this too (I know Ada95 does).   You risk messing up the precedence rules
if you try to handle "-" in the number lexical processing; it's easy to 
get things wrong.
I know that Microsoft Works has some screwed-up processing because of this;
unary "-" has different precedence, depending on whether or not it's in 
front of
a constant number.  Info here:
 http://www.burns-stat.com/pages/Tutor/spreadsheet_addiction.html

Excel, OOo, treat the precedence of unary "-" the same way whether it's
in front of a constant or not; this is easier to ensure if you don't try to
special-case it in the syntax.

> 5.4 - Constant strings -- Why do we escape characters by doubling? 
>  XML conventions are to use character entities, which lend themselves 
> better to XML tools.  I guess I'm wondering if embedded quotes is 
> really a format issue, or a application UI issue?  If it is just an 
> application UI issue, then I'd favor having the format just use 
> character entities like &quot;
They DO use character entities.  But that's not enough, because we have an
indirection that needs handling.  All of these values are stored in 
attributes,
so we already need &quot; etc. just to prevent "end of attribute" problems.
Consider the case where the formula is:
   ="I am a string"
That is represented in the XML this way:
  table:formula="=&quot;I am a string&quot;"

Now I want to insert a double-quote in the string. We can't do this:
 table:formula="=&quot;I am a string with a " character &quot;"
because we'll prematurely end the XML attribute, and we can't do this:
 table:formula="=&quot;I am a string with a &quot; character &quot;"
because we'll prematurely end the string -
it's going to look at the word "character" and expect an operator there.

Here's what this syntax does instead - the formula is
 ="I am a string with a "" character"
which is then represented in XML as:
 table:formula="=&quot;I am a string with a &quot;&quot; character &quot;"

This is very regular and easy to parse.  It's also a pretty
common solution to the problem.  I believe this is what OOo and KSpread
do already, in fact, so it's consistent with current practice.

> 5.5 - Operators -- "Multiply, divide. Division does not truncate, so 
> 1/2 is equal to 0.5." -- this statement belongs elsewhere, I think. 
>  Maybe an operator semantics section?
It's already in the operator definition of "/".  We could certainly 
remove it here.

>   Also, the table text sometimes uses the term "priority".  Is this 
> the same as "precedence"?  If so, we might want to use "precedence" 
> throughout.
You're right.  Good point.

> Rather than saying that precedence can be overridden by using 
> parentheses, how about just assigning precedence to ()?  Call it the 
> "grouping operator", at highest precedence.  That is what C/C++ does
Either way works.  However, I don't think any spreadsheet documentation
refers to parentheses as "grouping operators" - all docs I've seen say ()s
OVERRIDE precedence, not that they have the highest precedence.
So I think the way it's documented here is more consistent with the 
expectations
of spreadsheet users.  Also, if you use lex/yacc, you end up creating 
expressions
the way it's shown here.  So I have a mild preference for the current 
format.
It certainly doesn't matter from a technical view.


> 5.6 -- Is there a reason why predefined function names are limited to 
> ASCII characters?  In practice this may be true, as a fact.  But do we 
> lose anything by removing that restriction?  I'm thinking especially 
> of UOF/ODF work going forward, where there may be predefined functions 
> in Chinese characters.
Well, it's nice to be consistent, but that's the only reason.

Unless someone objects, let's transform this into a non-normative
(and normally unprinted) note.

> 5.7 -- I'd call this section "Implementation Extension Functions" or 
> something like that.  The fact that this section is in the Standard 
> means it cannot be Nonstandard <g>.
Sure it can.  Many standards have requirements for how to include 
non-standard
extensions, which is what this is.  The IETF RFCs for email describe the 
"x-" convention,
for example.
>  The issue is not that we have a function using  a nonstandard name, 
> it is that the function is specified.  I would merely state that the 
> extensions functions must be namespace prefixed, and should be 
> globally unique, with a suggestion that they use the host-name 
> conventions indicated.  I don't think we want those naming conventions 
> as a "must".  If we say, "This prefix must begin with a domain name 
> owned by the definer" then we lock out those who do not own a domain, 
> or anonymously defined functions.

I think it'd better be at least a "should".  Here's the problem: If 
someone includes
a NON-prefixed function name that isn't standard, then it's going to 
cause trouble
when a later function is defined with that name but does something 
different.
Every app that fails to include the prefix could cause trouble for all the
other apps.  And if they don't use DNS, they could accidentally stomp on 
each other;
only by using a consistent prefix convention can we be assured that they 
won't
stomp on each other.

Name registration is cheap and getting cheaper, I don't think that's a 
big limitation.

> "Applications that do not support a function should compute its result 
> as some Error value other than NA() when calculating its result." -- I 
> think we want to be more specific than that, choosing a specific error 
> value and making it a "must".

There are no other specific error values, therefore there's nothing to be
more specific about unless we change that too.  Which brings us to...

> 5.11 - How do we avoid defining a mandated set of constant error values?
We already do.  Nothing in the specification mandates a list.

> Spreadsheets documents store not only the formulas, but also the 
> last-calculated value at the time of saving.
True, but that's not the point of this section.  This doesn't necessarily
have anything to do with the calculated result.
>  These calculated values are used by some tools, like light weight 
> viewers, full-text indexers, script to convert ODF to XHTML, etc., 
> that do not include a calculation engine with them.  So if we want to 
> support a full-text engine that can find all ODF spreadsheets in a 
> document repository that contain divide by zero errors (a reasonable 
> use case), then we will need to require specific constant error values.
No, I don't think that will do much good.
The syntax section on errors here has nothing to do
with the OUTPUT of errors... it's only about forced INPUT of errors.
The syntax section here only describes how to include in-line errors, e.g.,
where you want to FORCE an error value without using a function.
This is NOT a common case, but since there are documents that use it,
we need a representation for it.

For tools like what you want, you want to monitor the RESULTS (output) 
of formulas,
presumably as they get calculated.  Having a standard representation
for errors in text results won't help much, because formulas that use
ISERROR() or ISERR() trivially hide such things from a naive text-only tool.
And mixing of errors is another issue; if you're searching for one error 
occurring,
seeing only the final result of a function may hide the very error 
you're seeking.
I don't think for your use case you'd be able to trust the results with
a trivial text-only tool.


>  So, I would make the values in the table be "must" and call out in 
> the later function semantics exactly which error value is returned for 
> which error conditions.
I do not think we'll get unanimity on such a list; we haven't before.
It's not even clear it's desirable; apps like OOo Calc have a much
richer error set that more clearly identifies a problem.

Even if we can define a full list, I
think that'll cause a 2-3 year delay in producing the spec, because of the
lengthy arguments over every one of the hundreds of functions.
There lots of edge cases where it's not clear which error should result. 
E.G.,
There are a vast number of "overlapping" situations where different errors
can result, and even different versions of the same application change
which error they produce for a given input.  Which one should you mandate?
If you end up saying just "some error", then why try to BE that specific?

And this is stuff that users don't care about.
Few spreadsheet documents actually care WHICH error is the problem;
if it's an error, it's an error.  I believe users typically try to 
create spreadsheets WITHOUT
the dreaded error markers, or at most use ISERR or ISERROR.
It'd be easy to get into dancing on the head of a pin arguments defining 
things further.
ERROR.TYPE is not exactly common, and even THAT doesn't imply that
this is the ONLY set of errors... only that errors can be MAPPED to a 
short list.

If in the future, there's enough convergence that a single list can then
be specified, let's do it then.  If you think my concerns won't actually
happen, please, let me know why!


> 5.13 -- "whitespace may not separate a function name from its initial 
> opening parentheses" -- why the restriction?  Is there any ambiguity 
> from having whitespace there?
No, not to my knowledge, directly.

The issue is disambiguating between named expressions
and function calls, because until you get to the "(" or non-"(" you don't
know what you have.  Requiring this means that if you hand-write your 
analyzer,
you don't have to put in a whitespace walker there.  It's not a big deal.
But I think Excel's display syntax would have a problem if this rule 
were relaxed,
and it might be tricky to store whitespace if the display format 
couldn't handle it.
So this is a mild concession to a common display format.

> Comments throughout -- there is a lot of good implementation advice 
> given in the text, along the lines of "implementations may do X Y and 
> Z in the UI, so long as the format written out is as specified".  This 
> is good info, but it is not normative, and does break the flow of the 
> specification a bit, in my opinion.  In the end, a user interface can 
> do whatever they want.

That's fair, we could definitely remove that text.
>  This is a file format specification.   I wonder if this 
> implementation advice could be moved into a non-normative appendix, 
> where it can be consolidated and made even better for that purpose?

A lot of that non-normative stuff is at the end of the syntax section, 
in a note.


Thanks for the comments!  Let's keep them coming!

--- David A. Wheeler
References:
- Syntax Comments
  - From: robert_weir@us.ibm.com