OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

office-comment message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]

Subject: Re: [office-comment] RSQ - Unstable Algorithm

Hi Leonard,

On Friday, 2007-06-08 21:58:26 +0300, Leonard Mada wrote:

>  RSQ(),
>  STEYX(), and
>  VAR(), VARA(),  VARP(), VARPA()
> suffer from the same shortcoming. They all mention the naive algorithm 
> (algorithms based on [n*sum(x[i]^2) - sum(x[i])^2] are inherently 
> unstable,  see paragraphs 6.16.69-73, and 6.16.79-82)
> The *Note* to some of these functions mentions that naive algorithms 
> *run into trouble*, but I do not understand the reason for mentioning 
> the naive algorithm in the ODF specification then. [NO such note for the 
> RSQ(), and VAR...() functions!]
> For the STDEV...() and VAR...() functions:
> using [ sum(x[i] - MEAN(x))^2 ] instead of [n*sum(x[i]^2) - sum(x[i])^2] 
> is a safe alternative.

Actually the OOoCalc implementation does the two-pass approach using
the mean value for the STDEV*() and VAR*() functions. We could give both
formulas, the "naive" mathematically correct one (because that is the
one you'll usually find in books) and the one interesting for better
numerical stability.

> Unfortunately, this stable algorithm implies passing *two times* through 
> the data, which bears a significant time penalty for large data sets or 
> data sets on a network drive!!!

It seems to me the two passes nowadays don't add that much penalty
anymore, as long as you build up a memory array and don't try to read
data twice from the very source, e.g. a network drive..

> There are alternative *one-pass* algorithms that are more stable than 
> the notoriously unstable algorithm (see on wikipedia, 
> http://en.wikipedia.org/wiki/Correlation#Computing_correlation_accurately_in_a_single_pass).

Well, yes, but that's about correlation.

> D. RSQ()
> [...]
> Though, I would think that rewriting the formula to use
>  [ sum(x[i] - MEAN(x))^2 ] instead of [n*sum(x[i]^2) - sum(x[i])^2]
> is a fairly safer alternative. Also, the numerator should be rewritten as:
>  sum[ (MEAN(Y) - Ycalc) * (Ycalc + MEAN(Y) - 2*Y[i]) ]
> (or else the same risk exists, that the computed r^2  is negative OR 
> bigger then 1)

I haven't verified, are these really valid transformations of the former?


 OpenOffice.org Engineering at Sun: http://blogs.sun.com/GullFOSS

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]