OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

office-comment message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]

Subject: Re: [office-comment] RSQ - Unstable Algorithm


I would like to discuss the following topics:
C. ALTERNATIVE FUNCTIONS: implement alternative functions using fast 
D. RSQ(): special problem: three-pass algorithm

<!-- @page { size: 21.59cm 27.94cm; margin: 2cm } H1 { margin-top: 
0.85cm; margin-bottom: 0.21cm; border-top: 1px solid #000000; 
border-bottom: none; border-left: none; border-right: none; padding-top: 
0.21cm; padding-bottom: 0cm; padding-left: 0cm; padding-right: 0cm; 
color: #000099; page-break-before: always } H1.western { font-family: 
"Times New Roman", serif; font-size: 18pt } H1.cjk { font-family: 
"Lucida Sans Unicode"; font-size: 18pt } H1.ctl { font-family: "Arial", 
sans-serif; font-size: 16pt } P { margin-bottom: 0cm } H2 { 
margin-bottom: 0.21cm; border: none; padding: 0cm; color: #000099; 
page-break-before: auto; page-break-after: auto } H2.western { 
font-family: "Times New Roman", serif; font-size: 14pt } H2.cjk { 
font-family: "Lucida Sans Unicode"; font-size: 14pt } H2.ctl { 
font-family: "Arial", sans-serif; font-size: 14pt; font-style: italic; 
font-weight: medium } H3 { margin-bottom: 0.21cm; border: none; padding: 
0cm; color: #000099; page-break-before: auto; page-break-after: auto } 
H3.western { font-family: "Times New Roman", serif; font-size: 13pt } 
H3.cjk { font-family: "Lucida Sans Unicode"; font-size: 13pt } H3.ctl { 
font-family: "Arial", sans-serif; font-size: 13pt; font-style: italic } 
It seems that:
  STEYX(), and
  VAR(), VARA(),  VARP(), VARPA()
suffer from the same shortcoming. They all mention the naive algorithm 
(algorithms based on [n*sum(x[i]^2) - sum(x[i])^2] are inherently 
unstable,  see paragraphs 6.16.69-73, and 6.16.79-82)

The *Note* to some of these functions mentions that naive algorithms 
*run into trouble*, but I do not understand the reason for mentioning 
the naive algorithm in the ODF specification then. [NO such note for the 
RSQ(), and VAR...() functions!]

For the STDEV...() and VAR...() functions:
using [ sum(x[i] - MEAN(x))^2 ] instead of [n*sum(x[i]^2) - sum(x[i])^2] 
is a safe alternative.

Unfortunately, this stable algorithm implies passing *two times* through 
the data, which bears a significant time penalty for large data sets or 
data sets on a network drive!!!

There are alternative *one-pass* algorithms that are more stable than 
the notoriously unstable algorithm (see on wikipedia, 

I am very supportive of the idea to implement *two sets* of functions:
 1.) the stable two-pass version (as currently named)
 2.) a *fast implementation* for every function mentioned above:
      - should use a fast one-pass algorithm
      - BUT more stable than the naive algorithm
      - should be named: STDEV...FAST()

D. RSQ()
As mentioned earlier, the RSQ() formula uses two (three) times the naive 
algorithm (AND uses both times the computed values as divisors). Without 
doing a formal calculation of the error terms, this algorithm feels 
*very, very unstable*.

Unfortunately, calculating the error of the coding algorithm even for 
something as simple as the standard deviation is NOT trivial, and is 
certainly ugly for the RSQ() function. I also  lost any contact with 
mathematics in 1995 and took a very different career. I would need to 
dig up old notes and search through books to be able to formally 
calculate the error term.

Though, I would think that rewriting the formula to use
  [ sum(x[i] - MEAN(x))^2 ] instead of [n*sum(x[i]^2) - sum(x[i])^2]
is a fairly safer alternative. Also, the numerator should be rewritten as:
  sum[ (MEAN(Y) - Ycalc) * (Ycalc + MEAN(Y) - 2*Y[i]) ]
(or else the same risk exists, that the computed r^2  is negative OR 
bigger then 1)

However, this implies a *significant time penalty*!

While computing the STDEV() and VAR() functions using this  approach 
implies a two-pass algorithm, the RSQ() will need a *three pass* algorithm:
 1.) first compute MEAN(X) and MEAN(Y)
 2.) then compute 'a' and 'b'
 3.) and ONLY then one is able to compute r^2 (as Ycalc is needed in 
this last step, AND Ycalc depends on 'a' and 'b')

This is really ugly. I would definitely recommend here a faster 
implementation, e.g. an RSQFAST() function, though I do NOT know the 
performance and stability of such an algorithm.

I will search around to see IF I find a better alternative.


Leonard Mada

Eike Rathke wrote:
> Hi Leonard,
> On Thursday, 2007-06-07 23:04:10 +0300, Leonard Mada wrote:
>> Unfortunately, the formulas/algorithms given for 'a' and 'b' seem to be 
>> computationally unstable
>> - they both use terms of the form n*sum(x^2) - (sum(x))^2 which are 
>> notoriously unstable
>> Although I did NOT do a formal calculation of the error term, having 
>> *two* occurrences of  this unstable calculation, and both times as 
>> *divisors* makes the formula very unreliable.
> Thanks for pointing out. Did you try it out in OOoCalc and noticed
> shortcomings?
>> - remove formula from specification
> I think that is not a good option because it would leave the definition
> without a mathematical description. Even if the given formula may
> numerically be instable at some point it is mathematically correct.
> However, we could add a note saying so.
>> - consult statistician knowledgeable of stable coding algorithms
> That of course would be favourable. As I also know from the OOo mailing
> lists that you're interested in that area, could you help us or are you
> in contact with one?
>   Eike

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]