[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]

*Subject*: **Re: [office-comment] RSQ - Unstable Algorithm**

*From*:**robert_weir@us.ibm.com***To*: Leonard Mada <discoleo@gmx.net>*Date*: Fri, 8 Jun 2007 15:44:35 -0400

In the end we should be defining the what the result of a function should be, not mandating an algorithm. We need to give the flexibility for multiprocessor or multi-core systems to optimize for their unique capabilities. So, we could say, this function shall give the same answer as this mathematical expression evaluates to, within some precision level. We specify something testable. But point is well taken about the need for algorithms that preserve precision. We should get info like this into an implementor's guide. Thanks as well for the comments on the string processing. The formula language you're looking at now is a distillation of what and how spreadsheets do today, which we derived by looking and the several commercial and open source examples available today. This is what users of spreadsheets are familiar with. At some point I'd like to do some experimentation with an entirely new spreadsheet formula language, designed from the ground up to be more powerful, with features like metrical units (1 meter plus 10 centimeters = 1.1 meters), formal type checking, more vectorization operations, etc. I'd do this in its own namespace, so as not to cause confusion with the official ODF formula language. But if enough vendors were interested, this could be added to some future version of ODF as an alternative formula language. -Rob ___________________________ Rob Weir Software Architect Workplace, Portal and Collaboration Software IBM Software Group email: robert_weir@us.ibm.com phone: 1-978-399-7122 blog: http://www.robweir.com/blog/ Leonard Mada <discoleo@gmx.net> wrote on 06/08/2007 02:58:26 PM: > Hi, > > I would like to discuss the following topics: > A. FUNCTIONS AFFECTED: STDEV...(), VAR...(), STEYX() and RSQ() > B. STABLE ALGORITHM: use 2-pass > C. ALTERNATIVE FUNCTIONS: implement alternative functions using fast > algorithms > D. RSQ(): special problem: three-pass algorithm > > > <!-- @page { size: 21.59cm 27.94cm; margin: 2cm } H1 { margin-top: > 0.85cm; margin-bottom: 0.21cm; border-top: 1px solid #000000; > border-bottom: none; border-left: none; border-right: none; padding- > top: 0.21cm; padding-bottom: 0cm; padding-left: 0cm; padding-right: > 0cm; color: #000099; page-break-before: always } H1.western { font- > family: "Times New Roman", serif; font-size: 18pt } H1.cjk { font- > family: "Lucida Sans Unicode"; font-size: 18pt } H1.ctl { font- > family: "Arial", sans-serif; font-size: 16pt } P { margin-bottom: > 0cm } H2 { margin-bottom: 0.21cm; border: none; padding: 0cm; > color: #000099; page-break-before: auto; page-break-after: auto } > H2.western { font-family: "Times New Roman", serif; font-size: 14pt > } H2.cjk { font-family: "Lucida Sans Unicode"; font-size: 14pt } > H2.ctl { font-family: "Arial", sans-serif; font-size: 14pt; font- > style: italic; font-weight: medium } H3 { margin-bottom: 0.21cm; > border: none; padding: 0cm; color: #000099; page-break-before: > auto; page-break-after: auto } H3.western { font-family: "Times New > Roman", serif; font-size: 13pt } H3.cjk { font-family: "Lucida Sans > Unicode"; font-size: 13pt } H3.ctl { font-family: "Arial", sans- > serif; font-size: 13pt; font-style: italic } --> A. FUNCTIONS AFFECTED > It seems that: > RSQ(), > STDEV(), STDEVA(), STDDEVP(), STDEVPA(), > STEYX(), and > VAR(), VARA(), VARP(), VARPA() > suffer from the same shortcoming. They all mention the naive algorithm > (algorithms based on [n*sum(x[i]^2) - sum(x[i])^2] are inherently > unstable, see paragraphs 6.16.69-73, and 6.16.79-82) > > The *Note* to some of these functions mentions that naive algorithms > *run into trouble*, but I do not understand the reason for mentioning > the naive algorithm in the ODF specification then. [NO such note for the > RSQ(), and VAR...() functions!] > > > B. STABLE ALGORITHM > For the STDEV...() and VAR...() functions: > using [ sum(x[i] - MEAN(x))^2 ] instead of [n*sum(x[i]^2) - sum(x[i])^2] > is a safe alternative. > > Unfortunately, this stable algorithm implies passing *two times* through > the data, which bears a significant time penalty for large data sets or > data sets on a network drive!!! > > There are alternative *one-pass* algorithms that are more stable than > the notoriously unstable algorithm (see on wikipedia, > http://en.wikipedia. > org/wiki/Correlation#Computing_correlation_accurately_in_a_single_pass). > > > C. ALTERNATIVE FUNCTIONS > I am very supportive of the idea to implement *two sets* of functions: > 1.) the stable two-pass version (as currently named) > 2.) a *fast implementation* for every function mentioned above: > - should use a fast one-pass algorithm > - BUT more stable than the naive algorithm > - should be named: STDEV...FAST() > > > D. RSQ() > As mentioned earlier, the RSQ() formula uses two (three) times the naive > algorithm (AND uses both times the computed values as divisors). Without > doing a formal calculation of the error terms, this algorithm feels > *very, very unstable*. > > Unfortunately, calculating the error of the coding algorithm even for > something as simple as the standard deviation is NOT trivial, and is > certainly ugly for the RSQ() function. I also lost any contact with > mathematics in 1995 and took a very different career. I would need to > dig up old notes and search through books to be able to formally > calculate the error term. > > Though, I would think that rewriting the formula to use > [ sum(x[i] - MEAN(x))^2 ] instead of [n*sum(x[i]^2) - sum(x[i])^2] > is a fairly safer alternative. Also, the numerator should be rewritten as: > sum[ (MEAN(Y) - Ycalc) * (Ycalc + MEAN(Y) - 2*Y[i]) ] > (or else the same risk exists, that the computed r^2 is negative OR > bigger then 1) > > However, this implies a *significant time penalty*! > > While computing the STDEV() and VAR() functions using this approach > implies a two-pass algorithm, the RSQ() will need a *three pass* algorithm: > 1.) first compute MEAN(X) and MEAN(Y) > 2.) then compute 'a' and 'b' > 3.) and ONLY then one is able to compute r^2 (as Ycalc is needed in > this last step, AND Ycalc depends on 'a' and 'b') > > This is really ugly. I would definitely recommend here a faster > implementation, e.g. an RSQFAST() function, though I do NOT know the > performance and stability of such an algorithm. > > I will search around to see IF I find a better alternative. > > Sincerely, > > Leonard Mada > > > > Eike Rathke wrote: > > Hi Leonard, > > > > On Thursday, 2007-06-07 23:04:10 +0300, Leonard Mada wrote: > > > > > >> Unfortunately, the formulas/algorithms given for 'a' and 'b' seem to be > >> computationally unstable > >> - they both use terms of the form n*sum(x^2) - (sum(x))^2 which are > >> notoriously unstable > >> > >> Although I did NOT do a formal calculation of the error term, having > >> *two* occurrences of this unstable calculation, and both times as > >> *divisors* makes the formula very unreliable. > >> > > > > Thanks for pointing out. Did you try it out in OOoCalc and noticed > > shortcomings? > > > > > >> SUGGESTIONS: > >> - remove formula from specification > >> > > > > I think that is not a good option because it would leave the definition > > without a mathematical description. Even if the given formula may > > numerically be instable at some point it is mathematically correct. > > However, we could add a note saying so. > > > > > >> - consult statistician knowledgeable of stable coding algorithms > >> > > > > That of course would be favourable. As I also know from the OOo mailing > > lists that you're interested in that area, could you help us or are you > > in contact with one? > > > > Eike > > This publicly archived list offers a means to provide input to the > OASIS Open Document Format for Office Applications (OpenDocument) TC. > > In order to verify user consent to the Feedback License terms and > to minimize spam in the list archive, subscription is required > before posting. > > Subscribe: office-comment-subscribe@lists.oasis-open.org > Unsubscribe: office-comment-unsubscribe@lists.oasis-open.org > List help: office-comment-help@lists.oasis-open.org > List archive: http://lists.oasis-open.org/archives/office-comment/ > Feedback License: http://www.oasis-open.org/who/ipr/feedback_license.pdf > List Guidelines: http://www.oasis-open.org/maillists/guidelines.php > Committee: http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=office >

**References**:**Re: [office-comment] RSQ - Unstable Algorithm***From:*Leonard Mada <discoleo@gmx.net>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]