office-comment message

Subject: Re: [office-comment] Re: [office] Next set of public comments forreview (#18-#35)
From: Leonard Mada <discoleo@gmx.net>
To: office-comment@lists.oasis-open.org
Date: Wed, 04 Jun 2008 01:42:33 +0300
Dear David, dear list members,

I need to refine slightly my first post and clarify one aspect.

I said that I do NOT like implicit conversions. That is indeed in 
general true - but a more accurate statement is:
I do NOT like implicit conversions when *I cannot control* them.

I.e. I do not like when Excel or another spreadsheet program decides 
what is best for me, instead of letting me decide this. I want to have 
the option to control when and how this conversion is done.

The S+ language allows exactly this. You can omit the missing value, or 
you can raise an error. It is the user who decides, NOT a predefined 
function implementation.


David A. Wheeler wrote:
> Leonard Mada:
>   
> [...]
>> This goes strongly against ALL concepts of professional statistical 
>> applications, where such values are given a special type/identifier, 
>> namely these are "missing values", or shortly NA. (e.g. R, 
>> http://cran.R-project.org, and others)
>>     
>
> First, R is not a spreadsheet application, it's a statistical package.
>   

It is a statistical environment. ;-)

The point is, S+ is a mature language, while spreadsheets are not. I 
found too many errors during my audits and most errors are mainly due to 
spreadsheet flaws (and the spaghetti code these flaws encourage). I 
wrote some time back a small lecture/presentation on mistake proofing 
and I invite everyone to read it. You may have a laugh, too. [Some bits 
are indeed funny.]

The first part is a very short introduction, while the 2nd part 
describes some aspects of mistake proofing.

INTRO
http://ux.openoffice.org/servlets/ReadMsg?list=discuss&msgNo=1379

PRESENTATION PROPER
http://ux.openoffice.org/servlets/ReadMsg?list=discuss&msgNo=1380

[I did not have time to continue the series.]

So, I would encourage everyone to learn from such mature languages (S+, 
Matlab, ...) and apply this knowledge to spreadsheets, too. After 20 
years of stagnation, there is a desperate need to improve the 
spreadsheets and bring real benefit to end users.

I do not believe that OpenFormula brings any major benefits. Does it 
solve users problems?
No!

Specific functions might contribute to solving a more complex problem - 
in many intermediate steps - but the burden is still on the users. I 
really recommend auditing real users with real life problems. It opens 
ones eyes. I might be sometimes wrong, especially my solutions might not 
work as I intended, but I always point to true problems.

What I do want to emphasize is the need to address the major problems of 
spreadsheets, and not some minor ones. Otherwise, history will be 
unforgiving. I already experienced custom solutions replacing 
spreadsheets (with implementation costs in the millions of US $) - and 
there is no turning back.

> It's spectacular at that purpose, but OpenFormula is _primarily_ for
> use in spreadsheet applications. We want OpenFormula to be
> _usable_ in other circumstances, but those must take second fiddle
> to its use in spreadsheets.
>
> I quite agree that there is a need for indicating
> "not available", but the standard way to do that in spreadsheets
> (going back to at least Lotus 1-2-3) is to use #N/A, and NOT merely
> an empty cell.

I do agree here, but this was never put to discussion. What I said was, 
I do NOT want an empty cell to be interpreted as something it is NOT (as 
0), IF I do not want it to be interpreted as such.

IF, however, I want to detect errors in spreadsheets, I might want to 
detect formulas that reference empty cells, so I want the formula to 
evaluate to #NA. Referencing the empty cell could have been an error in 
the first place, because I copy/paste-ed the formula and forgot to 
correct the references (which happens fairly often even when I design a 
spreadsheet).

> The value #N/A is a distinguished Error value
> returned by function NA().  So the mechanism for identifying important
> missing values already exists.  Using empty cells to mean #N/A
> is a dangerous practice;

Well, but most data in academic research that was missing (measurement 
not done) was of type"empty cell". Sometimes "NA" or "ND" as strings, 
but most often empty cells. So, this situation exists, and it must be 
handled flexible.

> spreadsheets typically contain many empty
> cells, and a viewer couldn't easily tell which empty cells were
> "important".
>   

This was never put to discussion. I discussed the option to instruct the 
formula that evaluates an empty cell, to either omit the empty cell, or 
report an error. When I instruct it to report an error, I will know that 
something went wrong, when there is an error. That is the whole point of 
NA in the first place. [Other uses are much less common.]

>>> Instead, there is a type "Reference", which _may_ refer to empty cells 
>>> or ranges that include empty cells. What happens to empty cells 
>>> depends on the required types and implicit conversion rules, as 
>>> defined in section 4.
>>>       
>> I hate implicit conversion rules. Namely, the conversion to (int/float) 
>> 0 is mostly unneeded and unwanted.
>>     
>
> We actually have FEWER implicit conversion rules compared to, say, OOXML.
>
> Excel and OOXML require that _ANY_ reference to text in a number
> context be FORCEABLY converted to a number.

This is something I strongly disagree and consider that Excel does it 
better. It is true, whatever conversion is considered, it will be 
plagued with potential for errors. [Even IF you explicitly state that 
"," is the decimal separator, how do you know that the primary user, who 
wrote the number as a string, meant the "," as the decimal separator and 
not as something else (e.g. an enumeration)?!]

The true problem is on a completely different level - and unfortunately 
overlooked, as I already pointed out on the OOo mailing lists.

The real question is:
Why did the initial user enter numbers as strings?

Answering this question and solving the primary problem will completely 
remove the need for this dangerous conversion. IF this problem is not 
removed, than do it like Excel is even preferable to ignoring text (as 
currently happens in OOo). Raising explicitly an error is a different 
alternative; having a user option to choose between these 2 alternatives 
is still better.

Sincerely,

Leonard

> That particular conversion
> is fraught with peril, because it can fail when you change locale.
> (e.g., if the string is "5.5", it will convert find in en_US but will fail in fr
> because the decimal separator there is "," not ".").  And many spreadsheet
> programs don't do that particular implicit conversion (e.g., Lotus 1-2-3, OOo).
> Thankfully, few spreadsheets actually require this particular conversion.
>
> But we have to handle _existing_ spreadsheet applications, and be usable
> to ordinary users.  They DO expect several implicit conversions, and build
> documents that require them.
>
> What's more, most spreadsheet authors are NOT skilled statisticians.
> We need to make the language relatively easy-to-use for
> these less-skilled people.
>
> Most languages have automatic conversions, because it's too painful
> to use those languages without them.  This is no different, and in any case,
> we're specifying _existing_ behavior which is necessary to exchange
> existing spreadsheet documents.
>
>   
>> There are basically 2 useful approaches:
>> A.) ignore the cell completely
>> -- without raising any error
>> -- without interpreting the cell as 0 or empty string
>> -- should be probably the default
>>     
>
> Looks like we agree completely, at least in the case where the
> function expects a NumberSequence or the like.  If a function
> expects a NumberSequence, it will ignore empty cells completely,
> without interpreting the cell as 0 or the empty string.
> That's already spec'ed.
>
> If a function expects a _scalar_ value, though, and gets a
> reference to an empty cell, what do we do?  Your first two
> criteria conflict; either we raise an error, or we interpret it somehow
> as a scalar (e.g., as a 0 or "").  We've decided to follow standard
> convention, and convert.  It's what existing documents require, and
> what existing users expect.
>
>   
>> B.) Interpret as an NA-Error and propagate the error downstream
>> -- probably only in a minority of cases
>> [e.g. debugging spreadsheets]
>>     
>
> Again, sounds like we agree that these are minority cases.
> You can actually do this today by using ISBLANK or COUNTBLANK, e.g.:
> IF(COUNBLANK(Range)>0; #N/A; dostuff(Range))
>
>   
>> As I (probably) mentioned in some previous message [I can't find it, 
>> though], R has a more advanced mechanism to handle missing values, and I 
>> would welcome any improvement in this area regarding ODF.
>>
>> IF a value is missing, one can pass additional arguments to R-functions, 
>> which specify how to handle those missing values.
>>     
>
> No existing spreadsheet application includes a mechanism for
> additional named arguments.  Obviously there are languages which have it
> (R, Ada, and Python come to mind).  But adding a major new
> unprecedented mechanism to handle a rare case seems extreme.
>
> If you want values to be considered as NA, then put #N/A in that cell.
> Problem solved.  It's also MUCH clearer what's going on.
>
>   
>> So, the default in R is *na.omit* and this is what spreadsheets should 
>> be doing, too. Omit the whole cell, don't include it in any 
>> calculations, do NOT even evaluate the empty cell! Definitely, do not 
>> evaluate it as "0"!
>>     
>
> That's already the case when a NumberSequence or other Sequence is
> the required type, and as I noted earlier, there really isn't a reasonable
> alternative default when a scalar is required.
>
> --- David A. Wheeler
>
>
References:
- Re: [office] Next set of public comments for review (#18-#35)
  - From: Leonard Mada <discoleo@gmx.net>
- Re: [office-comment] Re: [office] Next set of public comments forreview (#18-#35)
  - From: "David A. Wheeler" <dwheeler@dwheeler.com>