office-comment message

Subject: Re: [office-comment] RSQ - Unstable Algorithm
From: robert_weir@us.ibm.com
To: Leonard Mada <discoleo@gmx.net>
Date: Fri, 8 Jun 2007 15:44:35 -0400
In the end we should be defining the what the result of a function should 
be, not mandating an algorithm.  We need to give the flexibility for 
multiprocessor or multi-core systems to optimize for their unique 
capabilities.

So, we could say, this function shall give the same answer as this 
mathematical expression evaluates to, within some precision level.  We 
specify something testable.

But point is well taken about the need for algorithms that preserve 
precision.  We should get info like this into an implementor's guide.


Thanks as well for the comments on the string processing.   The formula 
language you're looking at now is a distillation of what and how 
spreadsheets do today, which we derived by looking and the several 
commercial and open source examples available today.  This is what users 
of spreadsheets are familiar with.

At some point I'd like to do some experimentation with an entirely new 
spreadsheet formula language, designed from the ground up to be more 
powerful, with features like metrical units (1 meter plus 10 centimeters = 
1.1 meters),  formal type checking, more vectorization operations, etc. 
I'd do this in its own namespace, so as not to cause confusion with the 
official ODF formula language.  But if enough vendors were interested, 
this could be added to some future version of ODF as an alternative 
formula language.

-Rob
___________________________

Rob Weir
Software Architect
Workplace, Portal and Collaboration Software
IBM Software Group

email: robert_weir@us.ibm.com
phone: 1-978-399-7122
blog: http://www.robweir.com/blog/

Leonard Mada <discoleo@gmx.net> wrote on 06/08/2007 02:58:26 PM:

> Hi,
> 
> I would like to discuss the following topics:
> A. FUNCTIONS AFFECTED: STDEV...(), VAR...(), STEYX() and RSQ()
> B. STABLE ALGORITHM: use 2-pass
> C. ALTERNATIVE FUNCTIONS: implement alternative functions using fast 
> algorithms
> D. RSQ(): special problem: three-pass algorithm
> 
> 
> <!-- @page { size: 21.59cm 27.94cm; margin: 2cm } H1 { margin-top: 

> 0.85cm; margin-bottom: 0.21cm; border-top: 1px solid #000000; 

> border-bottom: none; border-left: none; border-right: none; padding-
> top: 
0.21cm; padding-bottom: 0cm; padding-left: 0cm; padding-right:
> 0cm; 
color: #000099; page-break-before: always } H1.western { font-
> family: 
"Times New Roman", serif; font-size: 18pt } H1.cjk { font-
> family: 
"Lucida Sans Unicode"; font-size: 18pt } H1.ctl { font-
> family: "Arial", 
sans-serif; font-size: 16pt } P { margin-bottom: 
> 0cm } H2 { 
margin-bottom: 0.21cm; border: none; padding: 0cm; 
> color: #000099; 
page-break-before: auto; page-break-after: auto } 
> H2.western { 
font-family: "Times New Roman", serif; font-size: 14pt
> } H2.cjk { 
font-family: "Lucida Sans Unicode"; font-size: 14pt } 
> H2.ctl { 
font-family: "Arial", sans-serif; font-size: 14pt; font-
> style: italic; 
font-weight: medium } H3 { margin-bottom: 0.21cm; 
> border: none; padding: 
0cm; color: #000099; page-break-before: 
> auto; page-break-after: auto } 
H3.western { font-family: "Times New
> Roman", serif; font-size: 13pt } 
H3.cjk { font-family: "Lucida Sans
> Unicode"; font-size: 13pt } H3.ctl { 
font-family: "Arial", sans-
> serif; font-size: 13pt; font-style: italic } 
--> A. FUNCTIONS AFFECTED
> It seems that:
>   RSQ(),
>   STDEV(), STDEVA(), STDDEVP(), STDEVPA(),
>   STEYX(), and
>   VAR(), VARA(),  VARP(), VARPA()
> suffer from the same shortcoming. They all mention the naive algorithm 
> (algorithms based on [n*sum(x[i]^2) - sum(x[i])^2] are inherently 
> unstable,  see paragraphs 6.16.69-73, and 6.16.79-82)
> 
> The *Note* to some of these functions mentions that naive algorithms 
> *run into trouble*, but I do not understand the reason for mentioning 
> the naive algorithm in the ODF specification then. [NO such note for the 

> RSQ(), and VAR...() functions!]
> 
> 
> B. STABLE ALGORITHM
> For the STDEV...() and VAR...() functions:
> using [ sum(x[i] - MEAN(x))^2 ] instead of [n*sum(x[i]^2) - sum(x[i])^2] 

> is a safe alternative.
> 
> Unfortunately, this stable algorithm implies passing *two times* through 

> the data, which bears a significant time penalty for large data sets or 
> data sets on a network drive!!!
> 
> There are alternative *one-pass* algorithms that are more stable than 
> the notoriously unstable algorithm (see on wikipedia, 
> http://en.wikipedia.
> org/wiki/Correlation#Computing_correlation_accurately_in_a_single_pass).
> 
> 
> C. ALTERNATIVE FUNCTIONS
> I am very supportive of the idea to implement *two sets* of functions:
>  1.) the stable two-pass version (as currently named)
>  2.) a *fast implementation* for every function mentioned above:
>       - should use a fast one-pass algorithm
>       - BUT more stable than the naive algorithm
>       - should be named: STDEV...FAST()
> 
> 
> D. RSQ()
> As mentioned earlier, the RSQ() formula uses two (three) times the naive 

> algorithm (AND uses both times the computed values as divisors). Without 

> doing a formal calculation of the error terms, this algorithm feels 
> *very, very unstable*.
> 
> Unfortunately, calculating the error of the coding algorithm even for 
> something as simple as the standard deviation is NOT trivial, and is 
> certainly ugly for the RSQ() function. I also  lost any contact with 
> mathematics in 1995 and took a very different career. I would need to 
> dig up old notes and search through books to be able to formally 
> calculate the error term.
> 
> Though, I would think that rewriting the formula to use
>   [ sum(x[i] - MEAN(x))^2 ] instead of [n*sum(x[i]^2) - sum(x[i])^2]
> is a fairly safer alternative. Also, the numerator should be rewritten 
as:
>   sum[ (MEAN(Y) - Ycalc) * (Ycalc + MEAN(Y) - 2*Y[i]) ]
> (or else the same risk exists, that the computed r^2  is negative OR 
> bigger then 1)
> 
> However, this implies a *significant time penalty*!
> 
> While computing the STDEV() and VAR() functions using this  approach 
> implies a two-pass algorithm, the RSQ() will need a *three pass* 
algorithm:
>  1.) first compute MEAN(X) and MEAN(Y)
>  2.) then compute 'a' and 'b'
>  3.) and ONLY then one is able to compute r^2 (as Ycalc is needed in 
> this last step, AND Ycalc depends on 'a' and 'b')
> 
> This is really ugly. I would definitely recommend here a faster 
> implementation, e.g. an RSQFAST() function, though I do NOT know the 
> performance and stability of such an algorithm.
> 
> I will search around to see IF I find a better alternative.
> 
> Sincerely,
> 
> Leonard Mada
> 
> 
> 
> Eike Rathke wrote:
> > Hi Leonard,
> >
> > On Thursday, 2007-06-07 23:04:10 +0300, Leonard Mada wrote:
> >
> > 
> >> Unfortunately, the formulas/algorithms given for 'a' and 'b' seem to 
be 
> >> computationally unstable
> >> - they both use terms of the form n*sum(x^2) - (sum(x))^2 which are 
> >> notoriously unstable
> >>
> >> Although I did NOT do a formal calculation of the error term, having 
> >> *two* occurrences of  this unstable calculation, and both times as 
> >> *divisors* makes the formula very unreliable.
> >> 
> >
> > Thanks for pointing out. Did you try it out in OOoCalc and noticed
> > shortcomings?
> >
> > 
> >> SUGGESTIONS:
> >> - remove formula from specification
> >> 
> >
> > I think that is not a good option because it would leave the 
definition
> > without a mathematical description. Even if the given formula may
> > numerically be instable at some point it is mathematically correct.
> > However, we could add a note saying so.
> >
> > 
> >> - consult statistician knowledgeable of stable coding algorithms
> >> 
> >
> > That of course would be favourable. As I also know from the OOo 
mailing
> > lists that you're interested in that area, could you help us or are 
you
> > in contact with one?
> >
> >   Eike
> 
> This publicly archived list offers a means to provide input to the
> OASIS Open Document Format for Office Applications (OpenDocument) TC.
> 
> In order to verify user consent to the Feedback License terms and
> to minimize spam in the list archive, subscription is required
> before posting.
> 
> Subscribe: office-comment-subscribe@lists.oasis-open.org
> Unsubscribe: office-comment-unsubscribe@lists.oasis-open.org
> List help: office-comment-help@lists.oasis-open.org
> List archive: http://lists.oasis-open.org/archives/office-comment/
> Feedback License: http://www.oasis-open.org/who/ipr/feedback_license.pdf
> List Guidelines: http://www.oasis-open.org/maillists/guidelines.php
> Committee: 
http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=office
>
References:
- Re: [office-comment] RSQ - Unstable Algorithm
  - From: Leonard Mada <discoleo@gmx.net>