office-comment message

Subject: Re: [office-comment] RSQ - Unstable Algorithm
From: Leonard Mada <discoleo@gmx.net>
To: office-comment@lists.oasis-open.org
Date: Thu, 14 Jun 2007 22:58:55 +0300
Hi,

I refined the RSQ() algorithm. For real case examples (including the 
majestic failure of the old algorithm) AND for recalculations using a 
stable algorithm (see 2nd spreadsheet), see the following issue on the 
OOo bugs page:
http://www.openoffice.org/issues/show_bug.cgi?id=78250



Eike Rathke wrote:
> Hi Leonard,
>
> On Friday, 2007-06-08 21:58:26 +0300, Leonard Mada wrote:
>  
>>  RSQ(),
>>  STDEV(), STDEVA(), STDDEVP(), STDEVPA(),
>>  STEYX(), and
>>  VAR(), VARA(),  VARP(), VARPA()
>> suffer from the same shortcoming. They all mention the naive 
>> algorithm (algorithms based on [n*sum(x[i]2) - sum(x[i])2] are 
>> inherently unstable,  see paragraphs 6.16.69-73, and 6.16.79-82)
>>
>> ...
>>     
>
> Actually the OOoCalc implementation does the two-pass approach using
> the mean value for the STDEV*() and VAR*() functions. We could give both
> formulas, the "naive" mathematically correct one (because that is the
> one you'll usually find in books) and the one interesting for better
> numerical stability.
>
>  
>> Unfortunately, this stable algorithm implies passing *two times* 
>> through the data, which bears a significant time penalty for large 
>> data sets or data sets on a network drive!!!
>>     
>
> It seems to me the two passes nowadays don't add that much penalty
> anymore, as long as you build up a memory array and don't try to read
> data twice from the very source, e.g. a network drive.
>   

One month ago I had a very unpleasant surprise: I changed the formatting 
of a single column from TEXT to NUMBERS on a medium sized spreadsheet 
(~4,000 - 5,000 rows) on a 2 GHz machine in a competitors program and 
had to wait more than one hour for the recalculations to complete. I 
would be very cautious about  how fast modern computers are. Programs 
are usually NOT and they can break even the most advanced computer.

>> There are alternative *one-pass* algorithms that are more stable than 
>> the notoriously unstable algorithm (see on wikipedia, 
>> http://en.wikipedia.org/wiki/Correlation#Computing_correlation_accurately_in_a_single_pass). 
>>
>>     
>
> Well, yes, but that's about correlation.
>   

This one could be adapted to all other situations (CORREL() is by the 
way broken, too).

>> D. RSQ()
>> [...]
>> Though, I would think that rewriting the formula to use
>>  [ sum(x[i] - MEAN(x))2 ] instead of [n*sum(x[i]2) - sum(x[i])2]
>> is a fairly safer alternative. Also, the numerator should be 
>> rewritten as:
>>  sum[ (MEAN(Y) - Ycalc) * (Ycalc + MEAN(Y) - 2*Y[i]) ]
>> (or else the same risk exists, that the computed r2  is negative OR 
>> bigger then 1)
>>     
>
> I haven't verified, are these really valid transformations of the former?
>
>   Eike
>   

I have improved this algorithm even further. Basically, the algorithm 
goes like this:
1.) N*sum(Xi2) - (sum(Xi))2  is replaced by:
     N * sum(Xi - Xm)2 , where Xm = the MEAN of X
     the N will cancel later, so we do NOT need to multiply with N

2.) N*sum(Xi*Yi) - (sum(Xi))*(sum(Yi))  is equivalent to:
     N*sum[(Xi-Xm)*(Yi-Ym)]

3.) dividing the previous 2 quantities and canceling N gives:
     b = sum[(Xi-Xm)*(Yi-Ym)]/sum(Xi-Xm)2

4.) Ycalc[i] = a + b * Xi
    adding every Ycalc gives:
    sum(Yc[i]) = sum(a) + b*sum(Xi)
    however sum(Yc[i]) = sum(Yi) = N*Ym, therefore
    N*Ym = N*a + N*b*Xm => canceling N
    Ym = a + b * Xm =>
    a = Ym - b * Xm
    [I actually can show mathematically that *this is indeed TRUE*.]

5.) and final step, calculating R2
    first calculate:
    Yfirst      = Yi - Ym
    Ysecond = Yi - Ycalc[i] , then compute
    Yfirst - Ysecond, and
    Yfirst + Ysecond
    and finally:
    R2 = sum[(Yf - Ys)*(Yf + Ys)]/sum(Yi - Ym)2

As you can see, this algorithm works even for the very difficult test 
case I provided. (see the second spreadsheet on the OOo bug tracking site)

For steps 1-4 I am pretty confident that they are the best method to 
compute "a" and "b". I am still unsure IF step 5 is the best way to get 
R2, though this is surely more stable than the one provided in the 
original formula. There could exist even more stable algorithms (by 
decomposing sum[(Yf - Ys)*(Yf + Ys)], BUT I have NO idea how to 
decompose it, yet).

Hope this will help.

Sincerely,

Leonard Mada