office-formula message

Subject: Semantics

From: "David A. Wheeler" <dwheeler@dwheeler.com>
To: office-formula@lists.oasis-open.org
Date: Mon, 27 Feb 2006 20:15:17 -0500

Okay, here's an email specifically about semantics; please reply to this 
email
to discuss semantics.  That way, those of us with threading email clients
can follow the discussion. (Those of you without a threading client... 
good luck :-) ).

Tomas Mecir wrote:
>> * Semantics.  How strongly should we constrain semantics, and how should we determine them?
>>     
> Constraints should be high - that's what this is about. If things are
> too loose, there's no interoperability, because results differ.
>   

I guess I'm thinking we should "constrain semantics as much as we can... 
and no more".
The real test should be, "will typical spreadsheets with formulas 
interchange correctly?"
There may be some areas where we have to leave "wiggle room" in the 
standard, because we
can't reach consensus, but as long as USERS can exchange their 
spreadsheets, that's
not ideal but okay.  But those cases should be as rare as we can make them.

Generally spreadsheets agree on nearly all semantics anyway.
If you SUM a range, EVERYBODY skips the text values in the range.
If you add two numbers, you unsurprisingly get the total of the two 
numbers ;-).
And so on .

> Lowering constraints as low levels is fine, but the highest level
> should be as strict as reasonably possible.
>   
Yes, I think that makes sense.  Where this gets awkward is in
type handling and type conversions, and some examples will help:
* Is there a DISTINGUISHED logical type?  Excel, Gnumeric, and SheetToGo
   treat True/False as NOT a number, and thus SKIP them in SUM, etc.
   (SUM takes a list of NumberSequences, and NumberSequences skip 
non-numbers...
   since Logical is NOT a number, they are SKIPPED).
   OOo2, Lotus 1-2-3, Quattro Pro treat them as just another number, and
   INCLUDE them in SUM.
* What happens if Number is needed, but Text or Reference-to-Text
   is provided?  E.G., operator + takes two Number arguments.
   If given a non-number, everyone does something different (and here are
   my comments on their side-effects):
   + Excel/Gnumeric/SheetToGo:  Text and reference-to-text is taken, and it
      tries to convert to a number. It returns the converted number, 
else an Error.
      Basically, if given text or a reference to text, but the 
operator/function requires
      type Number, they automatically call the function VALUE() on the text.
      So "3"+3 is 6, and if A1 has "3", then A1+3 is 6.  "ab"+3 is an Error.
   + Lotus 1-2-3 and Quattro Pro: Text and reference-to-text is ALWAYS 0.
      So "3"+3 is 3, as is A1+3 and "ab"+3.  This is VERY consistent; 
it's also
      consistent when you compare it to SUM's semantics. But it's
      confusing to naive users who don't understand that "3" and 3 are 
different, and
      it's lethal to those who still type "l" for 1 and "O" for 0 (it 
causes silent errors).
      I'm not sure it's REALLY consistent with NumberSequence semantics; 
I suspect
      that AVERAGE shows that NumberSequence SKIPS the text, instead of
      considering text to be 0.  So the average of "5l" isn't the same 
as "51".
   + OpenOffice.org 2: Inline text is taken and a conversion is 
attempted, returning the
      number (if successful) or Error (if not).  If it's a REFERENCE to 
a text value,
      then 0 is ALWAYS returned.  So "3"+3 is 6, but A1+3 is 3 if A1 
contains "3" (!).
      This inconsistency between "3"+3 and A1+3 is very odd to me.  The 
OOo2 folks'
      argument is that this is more consistent with SUM.
   + KSpread 1.4: Always an Error.  The problem here is that naive users
      don't understand this.. naive users generally are clueless about 
data types
      (data types are NOT documented well by ANY implementation).
   + KSpread 1.5-beta: Text and reference-to-text is taken, and conversion
      attempted; if success, returns converted number, else returns 0.
      The risk of this semantic is that again, conversions "seem" okay, but
      failure go unnoticed.  The value "510" works fine, but "5l0" is 
silently 0
      (remember, many older typists were TRAINED to type l for 1 and
      O for 0, and you REALLY don't want to be surprised by their results).

(VALUE() is the portable string-to-number function... EVERYONE supports
that function to convert a Text value into a Number value.)

The problem with MANDATING these details at lower levels is that they
typically don't matter. Most people use Numbers where they plan to do 
calculations,
so handling type mixing simply doesn't occur often in practice.  But in the
rare cases it DOES occur, there's a risk of silent problems.

OpenFormula allowed many of these at lower levels, and picked a specific
semantic (the most common one, Excel/Gnumeric/SheetToGo) at level 3.
That way, we didn't impose a requirement where it appeared to be
unnecessary for most, but for those who wanted semantics to be
very specific, that was available.
At the lower levels, if you're worried about text-to-number conversions, 
you're
required to manually invoke VALUE() yourself (which DOES work portably,
everywhere)... which isn't ideal, but is probably rare in practice.

Oh, why that particular semantic at level 3?
One advantage of this particular semantic is that naive users either get
what they were probably expecting, OR they get a warning message
that there's a problem (instead of silently getting the wrong answer).

There's no easy way to force SUM to directly convert text values
in ANY implementation today, and the spec simply reflects that.
Yes, you could abuse IMSUM to do this, but that'd be locale-dependent
due to comma-vs-period issues, not everyone has IMSUM, and it's abuse 
anyway.
If you REALLY want that, create a column/row that uses
VALUE() to do the conversion... or even better, just create
numbers in the first place if you want to use data as numbers.
Generally it's better to make values numbers if you want to use them
as numbers, so there seems to be no need to create a function
to sum ranges of text.

> There should also be formal definitions of what the functions compute,
> and a reference algorithm should be provided, if possible.
>   
A formal definition in terms of a mathematical expression would be good.
I don't think we should mandate a particular algorithm, but mandating 
particular
test case results are fair game, and including NON-normative references to
"how to implement this well" would probably REALLY help implementors 
"get it right".
The OpenFormula project did that with STDEV, for example.. the "obvious" way
turns out to lead to square roots of negative numbers (due to numerical 
effects), so
a reference to the "better way to do it" seemed in order.

--- David A. Wheeler

Follow-Ups:
- Re: [office-formula] Semantics
  - From: "Tomas Mecir" <mecirt@gmail.com>
- Re: [office-formula] Semantics
  - From: Daniel Carrera <daniel.carrera@zmsl.com>

References:
- RE: [office-formula] Proposal: Work ONLY electronically after initial telecomm, except when we decide otherwise
  - From: Mary McRae <marypmcrae@gmail.com>
- Re: [office-formula] Proposal: Work ONLY electronically after initialtelecomm, except when we decide otherwise
  - From: Daniel Carrera <daniel.carrera@zmsl.com>
- Re: [office-formula] Proposal: Work ONLY electronically afterinitial telecomm, except when we decide otherwise
  - From: "David A. Wheeler" <dwheeler@dwheeler.com>
- Key Issues
  - From: "David A. Wheeler" <dwheeler@dwheeler.com>
- Re: [office-formula] Key Issues
  - From: "Tomas Mecir" <mecirt@gmail.com>