office-formula message

Subject: Re: [office-formula] Our next adventure: Types and conversions
From: "David A. Wheeler" <dwheeler@dwheeler.com>
To: office-formula@lists.oasis-open.org
Date: Sat, 01 Jul 2006 22:03:09 -0400
(Forgive me if you see this twice; I posted this earlier, but
for some reason it didn't seem to get sent out.)

==============================================

A while back I posted a request to discuss typing issues.
Below are the various public comments; let me know if you
want your private emails to me exposed.  BTW, although I _take_
private email, I'd really prefer that people posted publicly...
people cannot publicly reply, and nothing will change in the
spec if it doesn't come out publicly :-).

The "logical" type seems to be sorted out fairly easily -
option "C" seems the one people can agree on. That means that
in the spec we can describe a logical type, but clarify that
(1) it can be IMPLEMENTED using simply the numeric type,
instead of REQUIRING a distinct logical type,
(2) PERMIT a distinct logical type, and
(3) auto-convert... if you expect a single number, and get a
distinct logical type, auto-convert to 0 and 1.  If you expect
a logical and get a number, auto-convert as well.

The text/numeric issue was known to be thorny, and
look... it is!   We do NOT have consensus on any particular
approach.  I think I understand WHY people have their
different viewpoints.  But it is quite possible that we
will never come TO a consensus, and we need to make sure that
this does not hold up the rest of the work (more than it has).
Those who primarily value Excel compatibility want to do
automatic text-to-number conversion.  Yet that is almost
insane when trying to share spreadsheets between locales... and
most expect cross-locale sharing to increase, not decrease.
Most implementations convert to 0, which is at least locale-
insensitive, but it also can silently propogate errors.
"Error" is the safest value for an implementation, but I don't
hear many wanting to MANDATE that either!  And there's more,
see below for some of the comments.

Clearly it would be better to have an agreement on what
happens when text and numbers are combined, because then
naive users who do that can be assured that their spreadsheets
"work everywhere".  But that assumes that a single answer
can be agreed on.  It is worthless to write a spec that people
will NOT implement, and given the passion of some viewpoints,
I think that is a significant risk.  And we certainly don't
want to mandate an approach that in retrospect is found "wrong".

So the obvious question to me is, "can this spec
be useful at all without
choosing any particular approach?"   I believe in this case
the answer is "yes", at least for the first version of the spec.
People can simply always use numbers as input to other numbers;
this is what most people do, and that ALWAYS works.   Creating a
"VALUEL(text [; locale])" function would address other cases and
is probably a good idea, though it'd be a
committee invention and those always worry me.

We need to MOVE ON.  So here's what I suggest...

For NOW, we won't require any particular text/number conversion
algorithm, and we'll be clear that we won't.  That doesn't mean
the final spec will allow all options; anyone can still
raise the issue here in this forum, let's continue discussion.
If we CAN find a consensus that is more specific, GREAT!!!
But if not, users can still create spreadsheets that exchange
easily, they just need to not mix text/numbers. And remember, this
is only round one -- many specs add specificity as time goes
on (and as certain areas that were once controversial gain clarity).

But we need to work on other things too.
Given a lack of consensus on any
particular algorithm/approach, I think we should simply document
that a range of options are allowed in this version of the spec,
until we can agree on something better.  If we can agree on
something better before we deliver the document to the
ODF committee, great!!  But let's not hold up the
rest of things trying to gain agreement, which we may never get.



David A. Wheeler wrote:
> We've discussed syntax.  The next steps I see are:
> (1) Discuss types & the rules for their interaction,
> (2) Determine how to identify subsets
> (3) Define the functions
> 
> So, we need to discuss types & the rules for their interaction.
> Below are some options & my opinions, but my opinions are
> just that: opinions.  I want to know what the group consensus
> is, not just my opinion.  We've discussed many of these issues
> before, so hopefully what's below is a summary of that.
> 
> First: Which types?
> All seem to agree that implementations may add new types,
> but that certain types are so common that specifying them
> is valuable for interoperability.  All seem to agree that
> Text (String) and Number are two of those types.
> Many implementations have a "Logical" type that is
> _distinguishable_ from Number, while in others, a logical
> value is simple a Number that is 0 or 1 (with nonzero treated
> as true).  References are also another type.  There are more, but
> let's start there.  I see several options:
> A. Logical is REQUIRED to be distinct from Number.
>     Inconsistent with OOo, Lotus 1-2-3, & many others.
> B. Logical is REQUIRED to be the same as Number:
>     Inconsistent with Excel, Gnumeric.
> C. "Logical" used as a notional type (so that we can easily
>    identify functions that take/return logicals), but we
>    explicitly permit EITHER of the above.
> I recommend "C".  If implementations can co-exist
> with this variance, I think we can too.


Carrera: I would be happy with either A or C. I would not be happy with
B. A lot of standard programming languages do C or something like it.

Metcalf: I think "C" is OK.  Given the current diversity of
applications, it may be the only realistic choice.

Weir:  How about A, but allow automatic conversion to Number type?  I
believe this is different from C, in that the conversion from Number to
Logical would not be automatic.

(I clarified): I think that would be one of the sub-options of A
(A1: auto-convert, A2: do not auto-convert).  The current
implementations that have distinct logical types (Excel, Gnumeric,
SheetToGo, and probably others) all do auto-conversion
if the expected type is Number.

Kernick: I don't like leaving things so open, but I'm not sure I have a
better solution than C, I want to give it more thought.



> 
> Next up: how do they interact? In particular: if a function
> expects a Number but gets a Text value, then what?
> A. Text auto-converted to Number.  Excel & Gnumeric
>     do this.  There is a BIG PROBLEM here involving
>     locale... a conversion can work in one locale, & fail
>     in another.
>     A1. Conversions usually work using the
>     "current locale" this is a problem -- the SAME spreadsheet
>     would work in one place and fail in another.  Yes, many
>     implementations (e.g., Excel) really do this.
>     A2. An alternative would be to save the
>     locale of a spreadsheet on creation, and ALWAYS use that
>     "sheet locale" on an automatic conversion. But note that
>     nobody does that, so we have the risk of invention
>     without experience.
> B. Text converted to 0.  Lotus 1-2-3 does this.
> C. Text converted to 0 if via reference, and auto-converted
>     to number if in-line.  OOo 2.0 does this.
> D. Text converted to error.  Eike thinks this would be safest,
>     since it would force spreadsheet authors to notice.
>     I know of no one who does this.
> E. Allow some set of the above.  Spreadsheet users could
>     add VALUE() calls where they wanted conversions.
>     One problem: VALUE() doesn't have a fixed locale, it
>     uses the current locale (usually).  We could modify the
>     definition of VALUE, or create VALUEL(v;locale).
>     (Creating a new function as a standards invention
>     seems less evil to me, though it's still not ideal.)
> I recommend (E) with "all of the above except A2".
> I'm not sure we want to create VALUEL(), though
> it'd be nice to have.


Carrera: I don't like A. I like B, D and E. If I were designing a
spreadsheet for myself I'd pick B or D but I think that compatibility
with existing products demands E. So my vote would be for E.

== Metcalf ==
A2 might be OK if the locale is not allowed to vary and the formula
representation in the file would always use some particular locale (C, I
suppose).  After reading the file, the application would convert to the
user's locale, but the formula is always stored in the same way
everywhere.  That would require a conversion on both output and input,
though.

My primary concern is that my spreadsheets be readable anwhere.  It is
very important that valid implementations get the same answer.   D would
not work if a spreadsheet works for me, but fails with an error for
someone else.   Is that possible?

E might work, but would that require care on the part of the author to
use the VALUEL function to guarantee that a spreadsheet is portable?  Or
could that be automated?

I don't like B or C, though in the end we might be stuck with supporting
something like that, give that some applications already do this.


== Weir ==

Generally I'd rather give an error than to do something which has a good
chance of being wrong.  We've come a long way since spreadsheets first
came about.  Back then we programmed in assembler, and maybe the old C
before prototypes, when everything looked like an int.  Now we've
generally learned that strict typechecking is often a good thing.  But
this view is not universally held.  What makes sense for a spreadsheet
which is used by non-programmers?  We want them to avoid the common
mistakes, but we don't want to force them to work hard at it.

So, auto-conversion to text in the general case is dangerous, and I
don't mind forcing the user to invoke it specifically where needed,
perhaps specifying a fixed locale, or defaulting to the current runtime
locale.

So, I think I'm suggesting E, without A1 and without A2.  So, an error
for a naked use of text in the context of a number,  but allow
Value(text;locale) and value(text) to avoid the error.

== Kernick ==
I don't accept that it is only geeks who end up with =A1+3 (where
A1="3") and expect 6. I my organisation deal with a lot of spreadsheet
imported from text files spat out by customers' systems. These often end
up with conversion issues. Computers should just work! They shouldn't
put people through unnecessary pain or wearyingly steep learning curves.
Equally, they should ensure the user knows what is going on to ensure
they are getting the results they are seeking. This is a difficult path
to tread.

It is important to always remember that we are talking about the file
format not the UI.
If we insist on type conversion in the file format that doesn't preclude
implementations from doing implicit type conversions (A) or requiring
explicit conversions; but they MUST save with explicit conversions. So
file format has =VALUEL("3.5"; "GB")+3 and the UI can display to the
user ="3.5"+3, because it does A. Another implementation may display
=VALUE("3.5")+3, because it doesn't. But they must save back to
=VALUEL("3.5"; "GB")+3.

I don't think we should ignore locale (and I don't think locale can be
sheet wide - A2). The spreadsheet should calculate correctly, or tell
the user that it may not. This is why some argue for D - at least that
is deterministic.
Implementations can either support the VALUEL(v;locale) function or,
when loading the sheet and finding VALUEL functions for a locale other
than the machine locale, warn the user that the sheet contains
conversions from other locales. It could offer the user the choice to
convert the formulae to the machine locale with the caveat that the
results may differ; or it could lock the VALUEL and precedent cells so
that the user can only change those parts of the sheet it can deal with.
These are implementation issues. Ultimately implementations would
support VALUEL and allow a mix of locales in one sheet - so as the sheet
is passed around the globe it always computes correctly.

Some solution along these lines, IMO, brings great flexibility, is
forward looking and doesn't ignore the problem. Or have I missed something?

Can this be applied to the use of the logical type? I haven't thought
that through yet.

== Wheeler reply ==

Hmm, this is yet another approach, one I hadn't accounted for
(or thought of). You're right, you could completely hide this via
functions, which might not be displayed on some systems (due to their
semantics).

One challenge is that you may have ranges of numbers, and all functions
that take ranges must specify the difference too.  The traditional way has
been to append an "A" at the end of some function names; that won't work
for all.

I would add that you probably want a "VALUE-like function, but don't show it
to me", since people will want a VALUE() that shows, too.

== Kernick ==

I'll try. I'm not proposing a hidden function - so I don't think we're
thinking on exactly the same lines yet.

In the most simple form I'm proposing that we implement the VALUE
function with a locale parameter viz VALUE(string, locale).

My reasons why this will work are:
Implementations may choose to hide the locale from the users, and where
the locale matches the machine locale that makes sense (this is not part
of the spec). What implementations do with functions containing another
locale is up to them. To be completely compliant they should perform the
conversion with the specified locale settings. As part of the spec I
suggest we allow implementations to support value in the machine locale
only provided they warn the user before switching locale during file open.

Therefore, I don't think we have a VALUE and VALUEA function, we simply
have a VALUE function. The specification may call it VALUEL and not
actually support the VALUE function at all, in order to distinguish
between the common function in use and the specification. In that sense
the function may be hidden.

== Eike Rathke ==

>Mmm, our long conversations with Eike are going to start again, it seems.

  :-)

Based on my experience with OOo's bug tracking system this indeed is one
of the main differences that keep people mourning about "my spreadsheet
calculates different", just because they don't notice that text doesn't
calculate, respectively calculates as 0 in some cases. This is something
we'll have to address.

>> So, I think I'm suggesting E, without A1 and without A2.  So, an 
error for
>> a naked use of text in the context of a number,  but allow
>> Value(text;locale) and value(text) to avoid the error.

Again we made the same suggestion.



> I don't consider D being basically the same as B. B goes unnoticed,
> whereas D clearly indicates an error the user either has to correct or
> won't get a result at all.

== Mecir ==

Yet they both do restrict functionality. I think this is where our
opinions disagree, you're thinking about Joe the user, whereas I think
about the power users and such. So, the question at hand would be,
whether we want to make things less confusing for group 1, or more
powerful for group 2.
And you know my stand on this :) My experience shows that Joe the user
will be confused about things no matter what, so I'm more inclined
towards making things powerful, rather than limited in functionality
to avoid confusion.



--- David A. Wheeler