[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]

*Subject*: **Re: [office-comment] RSQ - Unstable Algorithm**

*From*:**Leonard Mada <discoleo@gmx.net>***To*: office-comment@lists.oasis-open.org*Date*: Thu, 14 Jun 2007 22:58:55 +0300

Hi, I refined the RSQ() algorithm. For real case examples (including the majestic failure of the old algorithm) AND for recalculations using a stable algorithm (see 2nd spreadsheet), see the following issue on the OOo bugs page: http://www.openoffice.org/issues/show_bug.cgi?id=78250 Eike Rathke wrote: > Hi Leonard, > > On Friday, 2007-06-08 21:58:26 +0300, Leonard Mada wrote: > >> RSQ(), >> STDEV(), STDEVA(), STDDEVP(), STDEVPA(), >> STEYX(), and >> VAR(), VARA(), VARP(), VARPA() >> suffer from the same shortcoming. They all mention the naive >> algorithm (algorithms based on [n*sum(x[i]2) - sum(x[i])2] are >> inherently unstable, see paragraphs 6.16.69-73, and 6.16.79-82) >> >> ... >> > > Actually the OOoCalc implementation does the two-pass approach using > the mean value for the STDEV*() and VAR*() functions. We could give both > formulas, the "naive" mathematically correct one (because that is the > one you'll usually find in books) and the one interesting for better > numerical stability. > > >> Unfortunately, this stable algorithm implies passing *two times* >> through the data, which bears a significant time penalty for large >> data sets or data sets on a network drive!!! >> > > It seems to me the two passes nowadays don't add that much penalty > anymore, as long as you build up a memory array and don't try to read > data twice from the very source, e.g. a network drive. > One month ago I had a very unpleasant surprise: I changed the formatting of a single column from TEXT to NUMBERS on a medium sized spreadsheet (~4,000 - 5,000 rows) on a 2 GHz machine in a competitors program and had to wait more than one hour for the recalculations to complete. I would be very cautious about how fast modern computers are. Programs are usually NOT and they can break even the most advanced computer. >> There are alternative *one-pass* algorithms that are more stable than >> the notoriously unstable algorithm (see on wikipedia, >> http://en.wikipedia.org/wiki/Correlation#Computing_correlation_accurately_in_a_single_pass). >> >> > > Well, yes, but that's about correlation. > This one could be adapted to all other situations (CORREL() is by the way broken, too). >> D. RSQ() >> [...] >> Though, I would think that rewriting the formula to use >> [ sum(x[i] - MEAN(x))2 ] instead of [n*sum(x[i]2) - sum(x[i])2] >> is a fairly safer alternative. Also, the numerator should be >> rewritten as: >> sum[ (MEAN(Y) - Ycalc) * (Ycalc + MEAN(Y) - 2*Y[i]) ] >> (or else the same risk exists, that the computed r2 is negative OR >> bigger then 1) >> > > I haven't verified, are these really valid transformations of the former? > > Eike > I have improved this algorithm even further. Basically, the algorithm goes like this: 1.) N*sum(Xi2) - (sum(Xi))2 is replaced by: N * sum(Xi - Xm)2 , where Xm = the MEAN of X the N will cancel later, so we do NOT need to multiply with N 2.) N*sum(Xi*Yi) - (sum(Xi))*(sum(Yi)) is equivalent to: N*sum[(Xi-Xm)*(Yi-Ym)] 3.) dividing the previous 2 quantities and canceling N gives: b = sum[(Xi-Xm)*(Yi-Ym)]/sum(Xi-Xm)2 4.) Ycalc[i] = a + b * Xi adding every Ycalc gives: sum(Yc[i]) = sum(a) + b*sum(Xi) however sum(Yc[i]) = sum(Yi) = N*Ym, therefore N*Ym = N*a + N*b*Xm => canceling N Ym = a + b * Xm => a = Ym - b * Xm [I actually can show mathematically that *this is indeed TRUE*.] 5.) and final step, calculating R2 first calculate: Yfirst = Yi - Ym Ysecond = Yi - Ycalc[i] , then compute Yfirst - Ysecond, and Yfirst + Ysecond and finally: R2 = sum[(Yf - Ys)*(Yf + Ys)]/sum(Yi - Ym)2 As you can see, this algorithm works even for the very difficult test case I provided. (see the second spreadsheet on the OOo bug tracking site) For steps 1-4 I am pretty confident that they are the best method to compute "a" and "b". I am still unsure IF step 5 is the best way to get R2, though this is surely more stable than the one provided in the original formula. There could exist even more stable algorithms (by decomposing sum[(Yf - Ys)*(Yf + Ys)], BUT I have NO idea how to decompose it, yet). Hope this will help. Sincerely, Leonard Mada

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]