[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]

*Subject*: **Re: [office-comment] Demand for modification of ODF file format aboutregression curve in spreadsheet**

*From*:**Leonard Mada <discoleo@gmx.net>***To*: office-comment@lists.oasis-open.org*Date*: Tue, 09 Dec 2008 00:08:02 +0200

Dear Patrick, I indeed strongly suggest to use a dedicated program for every regression. This means 2 things: 1.) use the dedicated language [maybe xml-ized for ODF-compatibility] 2.) build a bridge/interface to the program itself I will explain shortly why I find both issues relevant. 1.) Use the language The S+ language is a mature language. There might exist alternatives like in Octave, Mathematica, (...), but I am most accustomed to R (S+), and this language seems also most suitable for spreadsheet users. The other languages are more mathematically oriented. [Though ODF 3+ might implement multiple language support. ;-) ] Tex already allows implementing foreign code (to my knowledge), and there is even an R plugin to process R-code embedded in an ODF-stream (I think implemented at Novartis), but this is naturally a hack in the sense that an ODF-aware application won't process that code, but R itself will "process" the ODF-stream and replace the code with the output (skipping actually the true ODF). However, I believe that true open source means reusing existing facilities. In this respect, ODF should reuse the S+ language when useful within ODF. And I certainly think that S+ is more powerful than ODF in every domain related to statistics (including regressions). [IMPLEMENTATION DETAILS - there might exist better alternatives] What is needed is either to xml-ize the R-code and store it in the ODF, or indeed store it as a specific code-block within the ODF. I actually think that the R-code (S+ commands) can be easily xml-ized and stored as true XML within the ODF stream (especially some parts like formulas for regressions, which is actually a mathematical function). One of my previous messages might also be interesting (although in a different context): **Interface META-Functions** http://lists.oasis-open.org/archives/office-comment/200706/msg00019.html 2.) Use R instead of internal regression engine This has 2 major advantages: a.) R is more accurate and versatile, offering many more possibilities and finer control b.) R offers much more informations a.) Versatile: Does not need to be discussed really. I hope my previous example showed already the effects. Regarding accuracy, well, peeking through the code you will easily see that numerical analysts have been at work (please read this short message: http://sc.openoffice.org/servlets/ReadMsg?listName=dev&msgNo=3161). b.) A nice consequence of my previous example was the fact that I used initially a slightly wrong formula: a * x / (x*x + b * x + c) But then, beyond the coefficients a, b and c, R calculated also some statistics, which allowed me to drop the term (b* x) from the model, and recompute the coefficients. Actually, a mathematical engine will (almost) always compute something. The real problem is, IF that something makes sense. You do NOT get this information using a spreadsheet. [This is one of the main reasons I do not use regressions in spreadsheets.] [On a side note, IF the coefficient 'a' was very close to 0, R would have marked 'a' as statistically insignificant, but 'b' as significant. Since 'a cannot be insignificant, that would mean that the model is flawed and is NOT well represented by that equation - independent of the actually computed value for 'a', e.g.: y <- rnorm(1000) + 3*10^-16 * x / (x*x+1) x.nls<-nls(y~a*x / (x*x+b*x+c), start=list(a=1,b=0,c=1)) summary(x.nls) > Formula: y ~ a * x/(x * x + b * x + c) > > Parameters: > Estimate Std. Error t value Pr(>|t|) > a -0.0006002 0.0007155 -0.839 0.402 > b 2.0704560 0.0305020 67.879 <2e-16 *** > c 1.0724880 0.0314753 34.074 <2e-16 *** > --- > Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 The 'actual' results depend much on the random numbers, but such a case is strongly suggestive that our model is completely bogus, although the mathematical engine has computed a value for a, b and c. The variability in y is only insignificantly influenced by a mathematical term of the form a*x / (x*x + b * x + c), being largely given by the variability in the random numbers - rnorm(1000).] I hope this helps to explain some of my requests and previous comments. Sincerely, Leonard Patrick Durusau wrote: > Leonard, > > When you say: > >> The idea is: >> I want a mechanism to specify the formula used in the regression. >> Instead of storing a formula name, it would be wiser to store the >> formula itself. This way, one can easily build *complex models* and >> *multivariate models* (more than one variable). This is currently >> not-possible and ODF lags behind professional packages in every >> respect (well, Excel fares poor in this respect, too, but then you >> shouldn't look at Excel when doing regressions). > Would you suggest that we use R or something similar as the language > for such models? (I have utterly no position one way or the other but > would like to see us avoid having to define a language for such > purposes and then seek implementers for it.) > > What would that mean in your experience for interchange? > > I know of R by the name but don't know its history or the level of > support for various versions. > > Would this be a situation where the results of a model would be stored > in case the document was processed by an application that lacked R > support (assuming we chose that as the language)? > > Hope you are having a great day! > > Patrick > > Leonard Mada wrote: >> Dear Laurent, >> >> I miss some frequently encountered regression types. >> >> The most frequent regression type on binary outcome variables is a >> logistic regression. I therefore miss this one. >> >> However, what wonders me most, is the number of regression types >> used. Well, to state it differently, there is a specific name for >> every new regression type. >> >> There is a better alternative, and this alternative is already >> implemented in the S+ language and in the open source R program. It >> basically allows the user to specify the formula for the regression. >> >> There are basically 3 regression models: >> >> A.) Linear regression >> - formulas of type: y = intercept + a1 * X1 + a2 * X2 + a3 * X3 + ... >> - as seen, ODF doesn't permit a multivariate formula either, >> i.e. X1, X2, X3, ... are different variables >> >> B.) Generalized linear models >> - formulas differ slightly, but in the case of a logistic regression: >> p(y) = 1 / (1 + 1/exp(intercept + a1 * X1 + a2 * X2 + a3 * X3 + ...) ) >> where y is a binary variable and p(y) the probability of y >> >> C.) Non-Linear models >> - this is the most interesting >> - it allows specifying the formula for the regression >> - e.g. lets say we want to determine the coefficients a & b for: >> a * x / (x*x + b) >> in R, this looks like: >> model.nls <- nls( y ~ a*x / (x*x + b), start=list(a=1, b=1)) >> where y is the outcome and x is the variable >> >> As a practical example: >> [You can copy / paste this in R] >> x <- rnorm(1000) # generate 1,000 random numbers >> y <- rnorm(1000) + rnorm(1) * x / (x*x+1) >> x.nls<-nls(y~a*x / (x*x+b*x+c), start=list(a=1,b=0,c=1)) >> summary(x.nls) >> >>> Formula: y ~ a * x/(x * x + b * x + c) >>> >>> Parameters: >>> Estimate Std. Error t value Pr(>|t|) >>> a -1.83005 0.29605 -6.182 9.24e-10 *** >>> b -0.04774 0.14024 -0.340 0.734 >>> c 1.40155 0.35845 3.910 9.85e-05 *** >>> --- >>> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 >> >> We see, "b" is statistically non-significant and we can remove it >> from the model (giving us then a * x / (x*x + c); we can rerun the >> regression using this formula to obtain a better result ). >> >> The idea is: >> I want a mechanism to specify the formula used in the regression. >> Instead of storing a formula name, it would be wiser to store the >> formula itself. This way, one can easily build *complex models* and >> *multivariate models* (more than one variable). This is currently >> not-possible and ODF lags behind professional packages in every >> respect (well, Excel fares poor in this respect, too, but then you >> shouldn't look at Excel when doing regressions). >> >> Sincerely, >> >> Leonard >> >> >> Laurent BALLAND-POIRIER wrote: >>> Dear TC Members, >>> >>> Please find enclosed a file format modification demand that Ingrid >>> Halama and me wrote. It is about regression curves in spreadsheet. Some >>> data are missing in ODF to get compatibility with other spreadsheets >>> such as MS-Excel or Gnumeric. Numerous issues will not be solved till >>> these data can not be saved. >>> I hope I post in the right place. If not, please explain where to send >>> this demand. >>> >>> Best regards, >>> >>> Laurent BP

**Follow-Ups**:**Re: [office-comment] Demand for modification of ODF file format aboutregression curve in spreadsheet***From:*Bryce L Nordgren <bnordgren@fs.fed.us>

**References**:**Demand for modification of ODF file format about regression curvein spreadsheet***From:*Laurent BALLAND-POIRIER <Laurent.Balland-Poirier@laposte.net>

**Re: [office-comment] Demand for modification of ODF file format aboutregression curve in spreadsheet***From:*Leonard Mada <discoleo@gmx.net>

**Re: [office-comment] Demand for modification of ODF file format aboutregression curve in spreadsheet***From:*Patrick Durusau <patrick@durusau.net>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]