[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: Re: [office-comment] RSQ - Unstable Algorithm
Hi,
I refined the RSQ() algorithm. For real case examples (including the
majestic failure of the old algorithm) AND for recalculations using a
stable algorithm (see 2nd spreadsheet), see the following issue on the
OOo bugs page:
http://www.openoffice.org/issues/show_bug.cgi?id=78250
Eike Rathke wrote:
> Hi Leonard,
>
> On Friday, 2007-06-08 21:58:26 +0300, Leonard Mada wrote:
>
>> RSQ(),
>> STDEV(), STDEVA(), STDDEVP(), STDEVPA(),
>> STEYX(), and
>> VAR(), VARA(), VARP(), VARPA()
>> suffer from the same shortcoming. They all mention the naive
>> algorithm (algorithms based on [n*sum(x[i]2) - sum(x[i])2] are
>> inherently unstable, see paragraphs 6.16.69-73, and 6.16.79-82)
>>
>> ...
>>
>
> Actually the OOoCalc implementation does the two-pass approach using
> the mean value for the STDEV*() and VAR*() functions. We could give both
> formulas, the "naive" mathematically correct one (because that is the
> one you'll usually find in books) and the one interesting for better
> numerical stability.
>
>
>> Unfortunately, this stable algorithm implies passing *two times*
>> through the data, which bears a significant time penalty for large
>> data sets or data sets on a network drive!!!
>>
>
> It seems to me the two passes nowadays don't add that much penalty
> anymore, as long as you build up a memory array and don't try to read
> data twice from the very source, e.g. a network drive.
>
One month ago I had a very unpleasant surprise: I changed the formatting
of a single column from TEXT to NUMBERS on a medium sized spreadsheet
(~4,000 - 5,000 rows) on a 2 GHz machine in a competitors program and
had to wait more than one hour for the recalculations to complete. I
would be very cautious about how fast modern computers are. Programs
are usually NOT and they can break even the most advanced computer.
>> There are alternative *one-pass* algorithms that are more stable than
>> the notoriously unstable algorithm (see on wikipedia,
>> http://en.wikipedia.org/wiki/Correlation#Computing_correlation_accurately_in_a_single_pass).
>>
>>
>
> Well, yes, but that's about correlation.
>
This one could be adapted to all other situations (CORREL() is by the
way broken, too).
>> D. RSQ()
>> [...]
>> Though, I would think that rewriting the formula to use
>> [ sum(x[i] - MEAN(x))2 ] instead of [n*sum(x[i]2) - sum(x[i])2]
>> is a fairly safer alternative. Also, the numerator should be
>> rewritten as:
>> sum[ (MEAN(Y) - Ycalc) * (Ycalc + MEAN(Y) - 2*Y[i]) ]
>> (or else the same risk exists, that the computed r2 is negative OR
>> bigger then 1)
>>
>
> I haven't verified, are these really valid transformations of the former?
>
> Eike
>
I have improved this algorithm even further. Basically, the algorithm
goes like this:
1.) N*sum(Xi2) - (sum(Xi))2 is replaced by:
N * sum(Xi - Xm)2 , where Xm = the MEAN of X
the N will cancel later, so we do NOT need to multiply with N
2.) N*sum(Xi*Yi) - (sum(Xi))*(sum(Yi)) is equivalent to:
N*sum[(Xi-Xm)*(Yi-Ym)]
3.) dividing the previous 2 quantities and canceling N gives:
b = sum[(Xi-Xm)*(Yi-Ym)]/sum(Xi-Xm)2
4.) Ycalc[i] = a + b * Xi
adding every Ycalc gives:
sum(Yc[i]) = sum(a) + b*sum(Xi)
however sum(Yc[i]) = sum(Yi) = N*Ym, therefore
N*Ym = N*a + N*b*Xm => canceling N
Ym = a + b * Xm =>
a = Ym - b * Xm
[I actually can show mathematically that *this is indeed TRUE*.]
5.) and final step, calculating R2
first calculate:
Yfirst = Yi - Ym
Ysecond = Yi - Ycalc[i] , then compute
Yfirst - Ysecond, and
Yfirst + Ysecond
and finally:
R2 = sum[(Yf - Ys)*(Yf + Ys)]/sum(Yi - Ym)2
As you can see, this algorithm works even for the very difficult test
case I provided. (see the second spreadsheet on the OOo bug tracking site)
For steps 1-4 I am pretty confident that they are the best method to
compute "a" and "b". I am still unsure IF step 5 is the best way to get
R2, though this is surely more stable than the one provided in the
original formula. There could exist even more stable algorithms (by
decomposing sum[(Yf - Ys)*(Yf + Ys)], BUT I have NO idea how to
decompose it, yet).
Hope this will help.
Sincerely,
Leonard Mada
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]