OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

# office-comment message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]

Subject: Re: [office-comment] RSQ - Unstable Algorithm

• To: office-comment@lists.oasis-open.org
• Date: Fri, 08 Jun 2007 21:58:26 +0300

```Hi,

I would like to discuss the following topics:
A. FUNCTIONS AFFECTED: STDEV...(), VAR...(), STEYX() and RSQ()
B. STABLE ALGORITHM: use 2-pass
C. ALTERNATIVE FUNCTIONS: implement alternative functions using fast
algorithms
D. RSQ(): special problem: three-pass algorithm

<!-- @page { size: 21.59cm 27.94cm; margin: 2cm } H1 { margin-top:
0.85cm; margin-bottom: 0.21cm; border-top: 1px solid #000000;
border-bottom: none; border-left: none; border-right: none; padding-top:
color: #000099; page-break-before: always } H1.western { font-family:
"Times New Roman", serif; font-size: 18pt } H1.cjk { font-family:
"Lucida Sans Unicode"; font-size: 18pt } H1.ctl { font-family: "Arial",
sans-serif; font-size: 16pt } P { margin-bottom: 0cm } H2 {
margin-bottom: 0.21cm; border: none; padding: 0cm; color: #000099;
page-break-before: auto; page-break-after: auto } H2.western {
font-family: "Times New Roman", serif; font-size: 14pt } H2.cjk {
font-family: "Lucida Sans Unicode"; font-size: 14pt } H2.ctl {
font-family: "Arial", sans-serif; font-size: 14pt; font-style: italic;
font-weight: medium } H3 { margin-bottom: 0.21cm; border: none; padding:
0cm; color: #000099; page-break-before: auto; page-break-after: auto }
H3.western { font-family: "Times New Roman", serif; font-size: 13pt }
H3.cjk { font-family: "Lucida Sans Unicode"; font-size: 13pt } H3.ctl {
font-family: "Arial", sans-serif; font-size: 13pt; font-style: italic }
--> A. FUNCTIONS AFFECTED
It seems that:
RSQ(),
STDEV(), STDEVA(), STDDEVP(), STDEVPA(),
STEYX(), and
VAR(), VARA(),  VARP(), VARPA()
suffer from the same shortcoming. They all mention the naive algorithm
(algorithms based on [n*sum(x[i]^2) - sum(x[i])^2] are inherently
unstable,  see paragraphs 6.16.69-73, and 6.16.79-82)

The *Note* to some of these functions mentions that naive algorithms
*run into trouble*, but I do not understand the reason for mentioning
the naive algorithm in the ODF specification then. [NO such note for the
RSQ(), and VAR...() functions!]

B. STABLE ALGORITHM
For the STDEV...() and VAR...() functions:
using [ sum(x[i] - MEAN(x))^2 ] instead of [n*sum(x[i]^2) - sum(x[i])^2]
is a safe alternative.

Unfortunately, this stable algorithm implies passing *two times* through
the data, which bears a significant time penalty for large data sets or
data sets on a network drive!!!

There are alternative *one-pass* algorithms that are more stable than
the notoriously unstable algorithm (see on wikipedia,
http://en.wikipedia.org/wiki/Correlation#Computing_correlation_accurately_in_a_single_pass).

C. ALTERNATIVE FUNCTIONS
I am very supportive of the idea to implement *two sets* of functions:
1.) the stable two-pass version (as currently named)
2.) a *fast implementation* for every function mentioned above:
- should use a fast one-pass algorithm
- BUT more stable than the naive algorithm
- should be named: STDEV...FAST()

D. RSQ()
As mentioned earlier, the RSQ() formula uses two (three) times the naive
algorithm (AND uses both times the computed values as divisors). Without
doing a formal calculation of the error terms, this algorithm feels
*very, very unstable*.

Unfortunately, calculating the error of the coding algorithm even for
something as simple as the standard deviation is NOT trivial, and is
certainly ugly for the RSQ() function. I also  lost any contact with
mathematics in 1995 and took a very different career. I would need to
dig up old notes and search through books to be able to formally
calculate the error term.

Though, I would think that rewriting the formula to use
[ sum(x[i] - MEAN(x))^2 ] instead of [n*sum(x[i]^2) - sum(x[i])^2]
is a fairly safer alternative. Also, the numerator should be rewritten as:
sum[ (MEAN(Y) - Ycalc) * (Ycalc + MEAN(Y) - 2*Y[i]) ]
(or else the same risk exists, that the computed r^2  is negative OR
bigger then 1)

However, this implies a *significant time penalty*!

While computing the STDEV() and VAR() functions using this  approach
implies a two-pass algorithm, the RSQ() will need a *three pass* algorithm:
1.) first compute MEAN(X) and MEAN(Y)
2.) then compute 'a' and 'b'
3.) and ONLY then one is able to compute r^2 (as Ycalc is needed in
this last step, AND Ycalc depends on 'a' and 'b')

This is really ugly. I would definitely recommend here a faster
implementation, e.g. an RSQFAST() function, though I do NOT know the
performance and stability of such an algorithm.

I will search around to see IF I find a better alternative.

Sincerely,

Eike Rathke wrote:
> Hi Leonard,
>
> On Thursday, 2007-06-07 23:04:10 +0300, Leonard Mada wrote:
>
>
>> Unfortunately, the formulas/algorithms given for 'a' and 'b' seem to be
>> computationally unstable
>> - they both use terms of the form n*sum(x^2) - (sum(x))^2 which are
>> notoriously unstable
>>
>> Although I did NOT do a formal calculation of the error term, having
>> *two* occurrences of  this unstable calculation, and both times as
>> *divisors* makes the formula very unreliable.
>>
>
> Thanks for pointing out. Did you try it out in OOoCalc and noticed
> shortcomings?
>
>
>> SUGGESTIONS:
>> - remove formula from specification
>>
>
> I think that is not a good option because it would leave the definition
> without a mathematical description. Even if the given formula may
> numerically be instable at some point it is mathematically correct.
> However, we could add a note saying so.
>
>
>> - consult statistician knowledgeable of stable coding algorithms
>>
>
> That of course would be favourable. As I also know from the OOo mailing
> lists that you're interested in that area, could you help us or are you
> in contact with one?
>
>   Eike
```

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]