office-comment message

Subject: Re: [office-comment] Re: [office] Next set of public comments forreview (#18-#35)

From: "David A. Wheeler" <dwheeler@dwheeler.com>
To: office-comment@lists.oasis-open.org, discoleo@gmx.net
Date: Tue, 03 Jun 2008 16:35:33 -0400 (EDT)

Leonard Mada:
> this is a comment to one of the threads on the restricted 
> office@lists.oasis-open.org list.
> I believe the discussion is too important and the current implementation 
> is too dangerous to keep the current state of affairs.

Happy to discuss, but I don't at all agree that the current implementation
is "too dangerous".  People have used it, and accepted it,
in spreadsheets for more than 20 years.
 
> > I believe #24 is done. This involved handling references to empty 
> > cells; new text in section 4 (Types) clarifies what happens in these 
> > cases.
> >
> > Basically, there isn't a special type called the empty type.
> 
> This goes strongly against ALL concepts of professional statistical 
> applications, where such values are given a special type/identifier, 
> namely these are "missing values", or shortly NA. (e.g. R, 
> http://cran.R-project.org, and others)

First, R is not a spreadsheet application, it's a statistical package.
It's spectacular at that purpose, but OpenFormula is _primarily_ for
use in spreadsheet applications. We want OpenFormula to be
_usable_ in other circumstances, but those must take second fiddle
to its use in spreadsheets.

I quite agree that there is a need for indicating
"not available", but the standard way to do that in spreadsheets
(going back to at least Lotus 1-2-3) is to use #N/A, and NOT merely
an empty cell.  The value #N/A is a distinguished Error value
returned by function NA().  So the mechanism for identifying important
missing values already exists.  Using empty cells to mean #N/A
is a dangerous practice; spreadsheets typically contain many empty
cells, and a viewer couldn't easily tell which empty cells were
"important".

> > Instead, there is a type "Reference", which _may_ refer to empty cells 
> > or ranges that include empty cells. What happens to empty cells 
> > depends on the required types and implicit conversion rules, as 
> > defined in section 4.
> 
> I hate implicit conversion rules. Namely, the conversion to (int/float) 
> 0 is mostly unneeded and unwanted.

We actually have FEWER implicit conversion rules compared to, say, OOXML.

Excel and OOXML require that _ANY_ reference to text in a number
context be FORCEABLY converted to a number.  That particular conversion
is fraught with peril, because it can fail when you change locale.
(e.g., if the string is "5.5", it will convert find in en_US but will fail in fr
because the decimal separator there is "," not ".").  And many spreadsheet
programs don't do that particular implicit conversion (e.g., Lotus 1-2-3, OOo).
Thankfully, few spreadsheets actually require this particular conversion.

But we have to handle _existing_ spreadsheet applications, and be usable
to ordinary users.  They DO expect several implicit conversions, and build
documents that require them.

What's more, most spreadsheet authors are NOT skilled statisticians.
We need to make the language relatively easy-to-use for
these less-skilled people.

Most languages have automatic conversions, because it's too painful
to use those languages without them.  This is no different, and in any case,
we're specifying _existing_ behavior which is necessary to exchange
existing spreadsheet documents.

> There are basically 2 useful approaches:
> A.) ignore the cell completely
> -- without raising any error
> -- without interpreting the cell as 0 or empty string
> -- should be probably the default

Looks like we agree completely, at least in the case where the
function expects a NumberSequence or the like.  If a function
expects a NumberSequence, it will ignore empty cells completely,
without interpreting the cell as 0 or the empty string.
That's already spec'ed.

If a function expects a _scalar_ value, though, and gets a
reference to an empty cell, what do we do?  Your first two
criteria conflict; either we raise an error, or we interpret it somehow
as a scalar (e.g., as a 0 or "").  We've decided to follow standard
convention, and convert.  It's what existing documents require, and
what existing users expect.

> B.) Interpret as an NA-Error and propagate the error downstream
> -- probably only in a minority of cases
> [e.g. debugging spreadsheets]

Again, sounds like we agree that these are minority cases.
You can actually do this today by using ISBLANK or COUNTBLANK, e.g.:
IF(COUNBLANK(Range)>0; #N/A; dostuff(Range))

> As I (probably) mentioned in some previous message [I can't find it, 
> though], R has a more advanced mechanism to handle missing values, and I 
> would welcome any improvement in this area regarding ODF.
> 
> IF a value is missing, one can pass additional arguments to R-functions, 
> which specify how to handle those missing values.

No existing spreadsheet application includes a mechanism for
additional named arguments.  Obviously there are languages which have it
(R, Ada, and Python come to mind).  But adding a major new
unprecedented mechanism to handle a rare case seems extreme.

If you want values to be considered as NA, then put #N/A in that cell.
Problem solved.  It's also MUCH clearer what's going on.

> So, the default in R is *na.omit* and this is what spreadsheets should 
> be doing, too. Omit the whole cell, don't include it in any 
> calculations, do NOT even evaluate the empty cell! Definitely, do not 
> evaluate it as "0"!

That's already the case when a NumberSequence or other Sequence is
the required type, and as I noted earlier, there really isn't a reasonable
alternative default when a scalar is required.

--- David A. Wheeler

Follow-Ups:
- Re: [office-comment] Re: [office] Next set of public comments forreview (#18-#35)
  - From: Leonard Mada <discoleo@gmx.net>
- Implicit conversion and medical data
  - From: Patrick Durusau <patrick@durusau.net>

References:
- Re: [office] Next set of public comments for review (#18-#35)
  - From: Leonard Mada <discoleo@gmx.net>