[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]

*Subject*: **Re: [office-comment] RSQ - Unstable Algorithm**

*From*:**Eike Rathke <erack@sun.com>***To*: office-comment@lists.oasis-open.org*Date*: Wed, 13 Jun 2007 19:06:57 +0200

Hi Leonard, On Friday, 2007-06-08 21:58:26 +0300, Leonard Mada wrote: > RSQ(), > STDEV(), STDEVA(), STDDEVP(), STDEVPA(), > STEYX(), and > VAR(), VARA(), VARP(), VARPA() > suffer from the same shortcoming. They all mention the naive algorithm > (algorithms based on [n*sum(x[i]^2) - sum(x[i])^2] are inherently > unstable, see paragraphs 6.16.69-73, and 6.16.79-82) > > The *Note* to some of these functions mentions that naive algorithms > *run into trouble*, but I do not understand the reason for mentioning > the naive algorithm in the ODF specification then. [NO such note for the > RSQ(), and VAR...() functions!] > > > B. STABLE ALGORITHM > For the STDEV...() and VAR...() functions: > using [ sum(x[i] - MEAN(x))^2 ] instead of [n*sum(x[i]^2) - sum(x[i])^2] > is a safe alternative. Actually the OOoCalc implementation does the two-pass approach using the mean value for the STDEV*() and VAR*() functions. We could give both formulas, the "naive" mathematically correct one (because that is the one you'll usually find in books) and the one interesting for better numerical stability. > Unfortunately, this stable algorithm implies passing *two times* through > the data, which bears a significant time penalty for large data sets or > data sets on a network drive!!! It seems to me the two passes nowadays don't add that much penalty anymore, as long as you build up a memory array and don't try to read data twice from the very source, e.g. a network drive.. > There are alternative *one-pass* algorithms that are more stable than > the notoriously unstable algorithm (see on wikipedia, > http://en.wikipedia.org/wiki/Correlation#Computing_correlation_accurately_in_a_single_pass). Well, yes, but that's about correlation. > D. RSQ() > [...] > Though, I would think that rewriting the formula to use > [ sum(x[i] - MEAN(x))^2 ] instead of [n*sum(x[i]^2) - sum(x[i])^2] > is a fairly safer alternative. Also, the numerator should be rewritten as: > sum[ (MEAN(Y) - Ycalc) * (Ycalc + MEAN(Y) - 2*Y[i]) ] > (or else the same risk exists, that the computed r^2 is negative OR > bigger then 1) I haven't verified, are these really valid transformations of the former? Eike -- OpenOffice.org Engineering at Sun: http://blogs.sun.com/GullFOSS

**References**:**RSQ - Unstable Algorithm***From:*Leonard Mada <discoleo@gmx.net>

**Re: [office-comment] RSQ - Unstable Algorithm***From:*Eike Rathke <erack@sun.com>

**Re: [office-comment] RSQ - Unstable Algorithm***From:*Leonard Mada <discoleo@gmx.net>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]