[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: Re: [office-comment] RSQ - Unstable Algorithm
Hi, I would like to discuss the following topics: A. FUNCTIONS AFFECTED: STDEV...(), VAR...(), STEYX() and RSQ() B. STABLE ALGORITHM: use 2-pass C. ALTERNATIVE FUNCTIONS: implement alternative functions using fast algorithms D. RSQ(): special problem: three-pass algorithm <!-- @page { size: 21.59cm 27.94cm; margin: 2cm } H1 { margin-top: 0.85cm; margin-bottom: 0.21cm; border-top: 1px solid #000000; border-bottom: none; border-left: none; border-right: none; padding-top: 0.21cm; padding-bottom: 0cm; padding-left: 0cm; padding-right: 0cm; color: #000099; page-break-before: always } H1.western { font-family: "Times New Roman", serif; font-size: 18pt } H1.cjk { font-family: "Lucida Sans Unicode"; font-size: 18pt } H1.ctl { font-family: "Arial", sans-serif; font-size: 16pt } P { margin-bottom: 0cm } H2 { margin-bottom: 0.21cm; border: none; padding: 0cm; color: #000099; page-break-before: auto; page-break-after: auto } H2.western { font-family: "Times New Roman", serif; font-size: 14pt } H2.cjk { font-family: "Lucida Sans Unicode"; font-size: 14pt } H2.ctl { font-family: "Arial", sans-serif; font-size: 14pt; font-style: italic; font-weight: medium } H3 { margin-bottom: 0.21cm; border: none; padding: 0cm; color: #000099; page-break-before: auto; page-break-after: auto } H3.western { font-family: "Times New Roman", serif; font-size: 13pt } H3.cjk { font-family: "Lucida Sans Unicode"; font-size: 13pt } H3.ctl { font-family: "Arial", sans-serif; font-size: 13pt; font-style: italic } --> A. FUNCTIONS AFFECTED It seems that: RSQ(), STDEV(), STDEVA(), STDDEVP(), STDEVPA(), STEYX(), and VAR(), VARA(), VARP(), VARPA() suffer from the same shortcoming. They all mention the naive algorithm (algorithms based on [n*sum(x[i]^2) - sum(x[i])^2] are inherently unstable, see paragraphs 6.16.69-73, and 6.16.79-82) The *Note* to some of these functions mentions that naive algorithms *run into trouble*, but I do not understand the reason for mentioning the naive algorithm in the ODF specification then. [NO such note for the RSQ(), and VAR...() functions!] B. STABLE ALGORITHM For the STDEV...() and VAR...() functions: using [ sum(x[i] - MEAN(x))^2 ] instead of [n*sum(x[i]^2) - sum(x[i])^2] is a safe alternative. Unfortunately, this stable algorithm implies passing *two times* through the data, which bears a significant time penalty for large data sets or data sets on a network drive!!! There are alternative *one-pass* algorithms that are more stable than the notoriously unstable algorithm (see on wikipedia, http://en.wikipedia.org/wiki/Correlation#Computing_correlation_accurately_in_a_single_pass). C. ALTERNATIVE FUNCTIONS I am very supportive of the idea to implement *two sets* of functions: 1.) the stable two-pass version (as currently named) 2.) a *fast implementation* for every function mentioned above: - should use a fast one-pass algorithm - BUT more stable than the naive algorithm - should be named: STDEV...FAST() D. RSQ() As mentioned earlier, the RSQ() formula uses two (three) times the naive algorithm (AND uses both times the computed values as divisors). Without doing a formal calculation of the error terms, this algorithm feels *very, very unstable*. Unfortunately, calculating the error of the coding algorithm even for something as simple as the standard deviation is NOT trivial, and is certainly ugly for the RSQ() function. I also lost any contact with mathematics in 1995 and took a very different career. I would need to dig up old notes and search through books to be able to formally calculate the error term. Though, I would think that rewriting the formula to use [ sum(x[i] - MEAN(x))^2 ] instead of [n*sum(x[i]^2) - sum(x[i])^2] is a fairly safer alternative. Also, the numerator should be rewritten as: sum[ (MEAN(Y) - Ycalc) * (Ycalc + MEAN(Y) - 2*Y[i]) ] (or else the same risk exists, that the computed r^2 is negative OR bigger then 1) However, this implies a *significant time penalty*! While computing the STDEV() and VAR() functions using this approach implies a two-pass algorithm, the RSQ() will need a *three pass* algorithm: 1.) first compute MEAN(X) and MEAN(Y) 2.) then compute 'a' and 'b' 3.) and ONLY then one is able to compute r^2 (as Ycalc is needed in this last step, AND Ycalc depends on 'a' and 'b') This is really ugly. I would definitely recommend here a faster implementation, e.g. an RSQFAST() function, though I do NOT know the performance and stability of such an algorithm. I will search around to see IF I find a better alternative. Sincerely, Leonard Mada Eike Rathke wrote: > Hi Leonard, > > On Thursday, 2007-06-07 23:04:10 +0300, Leonard Mada wrote: > > >> Unfortunately, the formulas/algorithms given for 'a' and 'b' seem to be >> computationally unstable >> - they both use terms of the form n*sum(x^2) - (sum(x))^2 which are >> notoriously unstable >> >> Although I did NOT do a formal calculation of the error term, having >> *two* occurrences of this unstable calculation, and both times as >> *divisors* makes the formula very unreliable. >> > > Thanks for pointing out. Did you try it out in OOoCalc and noticed > shortcomings? > > >> SUGGESTIONS: >> - remove formula from specification >> > > I think that is not a good option because it would leave the definition > without a mathematical description. Even if the given formula may > numerically be instable at some point it is mathematically correct. > However, we could add a note saying so. > > >> - consult statistician knowledgeable of stable coding algorithms >> > > That of course would be favourable. As I also know from the OOo mailing > lists that you're interested in that area, could you help us or are you > in contact with one? > > Eike
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]