office-formula message

Subject: Re: [office-formula] List of functions in OpenFormula, dealingwith LEGACY.*

From: "David A. Wheeler" <dwheeler@dwheeler.com>
To: office-formula@lists.oasis-open.org
Date: Wed, 11 Jun 2008 10:33:35 -0400 (EDT)

(Regarding CHIDIST / CHIDISTR, and LEGACY.*):

Andreas J Guelzow
> When we discussed the LEGACY naming a long time ago, the point was that
> we wanted to use a name that clearly indicated that these functions are
> only included for compatibilities sake and really should not be used in
> new spreadsheets. So the fact that they sort far away from their regular
> position is in fact desirable.

Agreed.  But I don't see any particular _problem_ with using a function
that returns the right-tailed value. It's well-defined, and there
are reasons that people would WANT a right tail.
So, I don't see any particular reason to retire the function.

The problem is not that it's useless; the problem is that
the function's name is inconsistent with all other DIST functions
(which return the LEFT tail).

So, let's make sure that the name of a DIST function that returns a
RIGHT tail is, in fact, clearly different in the spec.  An "R" or
"RIGHT" suffix (or something similar) would do the trick.
I'm increasingly thinking that "RIGHT" would be a better suffix than "R";
suffixes (unlike prefixes) are easy to visually miss, and they change the results.
So, I like the idea of naming them "CHISQDIST" and "CHISQDISTRIGHT".
That way, the names are consistent, and if an app lazily
writes out "CHIDIST", the reader can tell that they did NOT mean CHISQDIST.

I don't even think we need "right tail"
functions for the other distributions, since this is the only function where
spreadsheets have typically implemented the right tail.
But we should use a naming convention that we COULD
extend to the other DIST functions, if that turns out to be important.

Does that sound okay?

> > I presume, then, that:
> > Excel NORMSDIST(x) = NORMDIST(x;0;1;TRUE()) ?
> > Is that right?
> 
> yes

Okay. So we _could_ retain NORMSDIST(x) as a synonym for
NORMDIST(x; 0; 1; TRUE()).  By retaining it, we would enable people to
round-trip a formula and retain a distinctive name (if they want to).

So again, NORMSDIST is a well-defined function, so I don't think we
need to mark it as "LEGACY".  It has perfectly clear (and reasonable)
semantics.

Sound reasonable?

> > As far as "have them optional" in NORMDIST, we could certainly do that,
> > just by declaring default values for the later parameters, presumably:
> >  Number Mean = 0; Number StandardDeviation = 0; Logical Cumulative = TRUE()
> 
> Number StandardDeviation = 1  (0 would never be used, but I guess that
> was a typo)

Whups, you're right, I meant 1.

> > If we do that, then I think we need to define defaults for the other
> > similar DIST functions.
> 
> well for the "cumulative" argument. THere is usually not a natural
> default for the other arguments.

Okay.  But I think Cumulative = TRUE() is reasonable for all DIST
functions, yes?  We should provide helpful defaults to users
where they make sense, it makes functions easier to use.

> > > CHITEST really doesn't make any mathematical sense.
> > 
> > I think that needs to be explained in the normative text, if that's really
> > the case.  I don't understand _why_ it doesn't make any sense -
> > can you help me understand that?  Preferably in text that we can
> > include, to explain to users why they shouldn't use it?
> 
> Which test does it perform if the data areas are rectangular? The
> implementation of CHITEST in Excel is a horrible mix of various tests.
> 
> I doubt that anybody who is using this really knows what they are
> calculating.
> 
> If r=1 or c=1 then it does calculate a Goodness of Fit Test. I have no
> idea what it calculates if they are both larger than 1. The degree of
> freedom looks like that it may be a test for independence or a test for
> homogeneity, but in that case how are the observed values used?

Well, let's ask Microsoft.  This may be a prime example of a function that
SHOULD be a "LEGACY." function, because its functionality is just wrong /
screwed up.

But what surprises me is that there's no corresponding "CHITEST" or
"CHISQTEST" that implements the "correct" behavior in our spec.
I understand creating "LEGACY." functions that implement
screwed-up semantics for the purposes of interoperability, but if
want people to stop using them, we need to provide them an alternative.

Can you help develop the definition of what they SHOULD be doing, soon?
Again, if the semantics are different from traditional spreadsheets, it
might be wise to name it slightly differently - I would suggest
CHISQTEST instead of CHITEST.

As you know, spreadsheets have long gotten a bad rap about handling statistics
(e.g., http://www.cs.uiowa.edu/~jcryer/JSMTalk2001.pdf).  For some things,
the best approach is to use a specialized tool (like R), but I think
we can make things better.

--- David A. Wheeler

Follow-Ups:
- Re: [office-formula] List of functions in OpenFormula, dealingwith LEGACY.*
  - From: "David A. Wheeler" <dwheeler@dwheeler.com>

References:
- List of functions in OpenFormula, dealing with LEGACY.*
  - From: "David A. Wheeler" <dwheeler@dwheeler.com>
- Re: [office-formula] List of functions in OpenFormula, dealingwith LEGACY.*
  - From: Andreas J Guelzow <aguelzow@math.concordia.ab.ca>
- Re: [office-formula] List of functions in OpenFormula, dealingwith LEGACY.*
  - From: "David A. Wheeler" <dwheeler@dwheeler.com>
- Re: [office-formula] List of functions in OpenFormula, dealingwith LEGACY.*
  - From: Andreas J Guelzow <aguelzow@math.concordia.ab.ca>