OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

# office-comment message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]

Subject: Re: [office-comment] RSQ - Unstable Algorithm

• From: Eike Rathke <erack@sun.com>
• To: office-comment@lists.oasis-open.org
• Date: Wed, 13 Jun 2007 19:06:57 +0200

```Hi Leonard,

On Friday, 2007-06-08 21:58:26 +0300, Leonard Mada wrote:

>  RSQ(),
>  STDEV(), STDEVA(), STDDEVP(), STDEVPA(),
>  STEYX(), and
>  VAR(), VARA(),  VARP(), VARPA()
> suffer from the same shortcoming. They all mention the naive algorithm
> (algorithms based on [n*sum(x[i]^2) - sum(x[i])^2] are inherently
> unstable,  see paragraphs 6.16.69-73, and 6.16.79-82)
>
> The *Note* to some of these functions mentions that naive algorithms
> *run into trouble*, but I do not understand the reason for mentioning
> the naive algorithm in the ODF specification then. [NO such note for the
> RSQ(), and VAR...() functions!]
>
>
> B. STABLE ALGORITHM
> For the STDEV...() and VAR...() functions:
> using [ sum(x[i] - MEAN(x))^2 ] instead of [n*sum(x[i]^2) - sum(x[i])^2]
> is a safe alternative.

Actually the OOoCalc implementation does the two-pass approach using
the mean value for the STDEV*() and VAR*() functions. We could give both
formulas, the "naive" mathematically correct one (because that is the
one you'll usually find in books) and the one interesting for better
numerical stability.

> Unfortunately, this stable algorithm implies passing *two times* through
> the data, which bears a significant time penalty for large data sets or
> data sets on a network drive!!!

It seems to me the two passes nowadays don't add that much penalty
anymore, as long as you build up a memory array and don't try to read
data twice from the very source, e.g. a network drive..

> There are alternative *one-pass* algorithms that are more stable than
> the notoriously unstable algorithm (see on wikipedia,
> http://en.wikipedia.org/wiki/Correlation#Computing_correlation_accurately_in_a_single_pass).

Well, yes, but that's about correlation.

> D. RSQ()
> [...]
> Though, I would think that rewriting the formula to use
>  [ sum(x[i] - MEAN(x))^2 ] instead of [n*sum(x[i]^2) - sum(x[i])^2]
> is a fairly safer alternative. Also, the numerator should be rewritten as:
>  sum[ (MEAN(Y) - Ycalc) * (Ycalc + MEAN(Y) - 2*Y[i]) ]
> (or else the same risk exists, that the computed r^2  is negative OR
> bigger then 1)

I haven't verified, are these really valid transformations of the former?

Eike

--
OpenOffice.org Engineering at Sun: http://blogs.sun.com/GullFOSS
```

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]