office-comment message

Subject: Re: [office-comment] RSQ - Unstable Algorithm

From: Eike Rathke <erack@sun.com>
To: office-comment@lists.oasis-open.org
Date: Wed, 13 Jun 2007 19:06:57 +0200

Hi Leonard,

On Friday, 2007-06-08 21:58:26 +0300, Leonard Mada wrote:

>  RSQ(),
>  STDEV(), STDEVA(), STDDEVP(), STDEVPA(),
>  STEYX(), and
>  VAR(), VARA(),  VARP(), VARPA()
> suffer from the same shortcoming. They all mention the naive algorithm 
> (algorithms based on [n*sum(x[i]^2) - sum(x[i])^2] are inherently 
> unstable,  see paragraphs 6.16.69-73, and 6.16.79-82)
> 
> The *Note* to some of these functions mentions that naive algorithms 
> *run into trouble*, but I do not understand the reason for mentioning 
> the naive algorithm in the ODF specification then. [NO such note for the 
> RSQ(), and VAR...() functions!]
> 
> 
> B. STABLE ALGORITHM
> For the STDEV...() and VAR...() functions:
> using [ sum(x[i] - MEAN(x))^2 ] instead of [n*sum(x[i]^2) - sum(x[i])^2] 
> is a safe alternative.

Actually the OOoCalc implementation does the two-pass approach using
the mean value for the STDEV*() and VAR*() functions. We could give both
formulas, the "naive" mathematically correct one (because that is the
one you'll usually find in books) and the one interesting for better
numerical stability.


> Unfortunately, this stable algorithm implies passing *two times* through 
> the data, which bears a significant time penalty for large data sets or 
> data sets on a network drive!!!

It seems to me the two passes nowadays don't add that much penalty
anymore, as long as you build up a memory array and don't try to read
data twice from the very source, e.g. a network drive..

> There are alternative *one-pass* algorithms that are more stable than 
> the notoriously unstable algorithm (see on wikipedia, 
> http://en.wikipedia.org/wiki/Correlation#Computing_correlation_accurately_in_a_single_pass).

Well, yes, but that's about correlation.


> D. RSQ()
> [...]
> Though, I would think that rewriting the formula to use
>  [ sum(x[i] - MEAN(x))^2 ] instead of [n*sum(x[i]^2) - sum(x[i])^2]
> is a fairly safer alternative. Also, the numerator should be rewritten as:
>  sum[ (MEAN(Y) - Ycalc) * (Ycalc + MEAN(Y) - 2*Y[i]) ]
> (or else the same risk exists, that the computed r^2  is negative OR 
> bigger then 1)

I haven't verified, are these really valid transformations of the former?

  Eike

-- 
 OpenOffice.org Engineering at Sun: http://blogs.sun.com/GullFOSS

References:
- RSQ - Unstable Algorithm
  - From: Leonard Mada <discoleo@gmx.net>
- Re: [office-comment] RSQ - Unstable Algorithm
  - From: Eike Rathke <erack@sun.com>
- Re: [office-comment] RSQ - Unstable Algorithm
  - From: Leonard Mada <discoleo@gmx.net>