[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]

*Subject*: **Re: [office-comment] Demand for modification of ODF file format aboutregression curve in spreadsheet**

*From*:**Bryce L Nordgren <bnordgren@fs.fed.us>***To*: Leonard Mada <discoleo@gmx.net>*Date*: Mon, 8 Dec 2008 17:13:11 -0700

Would it be fair to observe that this change request highlights a shortcoming of the spreadsheet model itself? Not the ODF spreadsheet model alone, mind you, but spreadsheets in general? In particular, spreadsheets really have no mechanism to handle analytical equations at all. They deal primarily with actual data values and not equations. And so they enumerate a handful of the most common models which can be applied to the data. The fact that cells may be assigned formulas is somewhat misleading, as there really is no such thing as a "variable". A cell reference is just not the same thing. Nearly all math apps approach this problem with either a strong data-oriented focus (Matlab, IDL, ...) or a strong analytical focus (Mathematica, ...). There is always a need for both and so there is always some form of cross-pollenation. For instance, when data-driven programs fit data to a model, they typically have no vocabulary to express the model they are fitting; and so return the fitted parameters as a vector or array. The spreadsheet document has a strong data-oriented focus. It has no vocabulary for expressing analytical equations, nor facilities for manipulating them. If you are thinking of XML-izing something like S+ to add a strong analytical focus to the ODF suite of data content models, I would humbly suggest defining another document type. Please note that there is already a "space" for it. Annex C of v1.1 already mentions the extension *.odf and the MIME type application/vnd.oasis.opendocument.formula. Yet the content model for such a file is unspecified. So define the content model. It is obviously intended to support some form of mathematically-oriented document. But take a minute to step back and really think about what you need. If you do this with S+ (or IDL, or Matlab or Mathematica or any of them) you are in effect defining an orthodox scripting language for all applications which edit ODF files. These languages are fully fleged procedural languages with their own standard libraries and their own mechanisms for including code not defined locally. Would an augmentation of the <office:script/> element suffice? Specify how a spreadsheet cell (or arbitrary other thing) would trigger execution of code within a specific script element, and how a script object declares its parameters and return values. Specify how data types are handled (e.g., provide a common core that is currently comprehensible to the spreadsheet cell element, provide a means for the script to declare new data types that it understands...). Keep it language neutral and just define how you shove data in and how you accept data from the Big Black Box. You'd also need to define how externally specified code is included. The contents of the ODF file would need to contain the source code, and also a CORBA/Interface Definition Language/other description of the <script/> element's contents. Scripts aren't really all that useful if you cannot express how they interact with the data in the rest of the document. They're also dangerous to rely on: you can't ensure that the person you're sending the document to will have the right language plugin, or that it will even be available for their machine/app. I think both approaches have merits and pitfalls. If you choose the route of selecting an orthodox scripting language, all conforming implementations should be interoperable. Augmenting the script element encourages people to use "their favorite package", which may mean that you may not be able to recompute the regression line if you received the spreadsheet from someone who has a different pet package (either free or commercial), even if YOUR package has a routine which does the same thing (with a slightly different name/parameter list). Bryce Leonard Mada <discoleo@gmx.net> wrote on 12/08/2008 03:08:02 PM: > Dear Patrick, > > I indeed strongly suggest to use a dedicated program for every > regression. This means 2 things: > 1.) use the dedicated language > [maybe xml-ized for ODF-compatibility] > 2.) build a bridge/interface to the program itself > > I will explain shortly why I find both issues relevant. > > 1.) Use the language > The S+ language is a mature language. There might exist alternatives > like in Octave, Mathematica, (...), but I am most accustomed to R (S+), > and this language seems also most suitable for spreadsheet users. The > other languages are more mathematically oriented. [Though ODF 3+ might > implement multiple language support. ;-) ] > > Tex already allows implementing foreign code (to my knowledge), and > there is even an R plugin to process R-code embedded in an ODF-stream (I > think implemented at Novartis), but this is naturally a hack in the > sense that an ODF-aware application won't process that code, but R > itself will "process" the ODF-stream and replace the code with the > output (skipping actually the true ODF). > > However, I believe that true open source means reusing existing > facilities. In this respect, ODF should reuse the S+ language when > useful within ODF. And I certainly think that S+ is more powerful than > ODF in every domain related to statistics (including regressions). > > [IMPLEMENTATION DETAILS - there might exist better alternatives] > What is needed is either to xml-ize the R-code and store it in the ODF, > or indeed store it as a specific code-block within the ODF. I actually > think that the R-code (S+ commands) can be easily xml-ized and stored as > true XML within the ODF stream (especially some parts like formulas for > regressions, which is actually a mathematical function). > > One of my previous messages might also be interesting (although in a > different context): > **Interface META-Functions** > http://lists.oasis-open.org/archives/office-comment/200706/msg00019.html > > 2.) Use R instead of internal regression engine > This has 2 major advantages: > a.) R is more accurate and versatile, offering many more possibilities > and finer control > b.) R offers much more informations > > a.) Versatile: Does not need to be discussed really. I hope my previous > example showed already the effects. Regarding accuracy, well, peeking > through the code you will easily see that numerical analysts have been > at work (please read this short message: > http://sc.openoffice.org/servlets/ReadMsg?listName=dev&msgNo=3161). > > b.) A nice consequence of my previous example was the fact that I used > initially a slightly wrong formula: > a * x / (x*x + b * x + c) > > But then, beyond the coefficients a, b and c, R calculated also some > statistics, which allowed me to drop the term (b* x) from the model, and > recompute the coefficients. > > Actually, a mathematical engine will (almost) always compute something. > The real problem is, IF that something makes sense. > > You do NOT get this information using a spreadsheet. [This is one of the > main reasons I do not use regressions in spreadsheets.] > [On a side note, IF the coefficient 'a' was very close to 0, R would > have marked 'a' as statistically insignificant, but 'b' as significant. > Since 'a cannot be insignificant, that would mean that the model is > flawed and is NOT well represented by that equation - independent of the > actually computed value for 'a', e.g.: > y <- rnorm(1000) + 3*10^-16 * x / (x*x+1) > x.nls<-nls(y~a*x / (x*x+b*x+c), start=list(a=1,b=0,c=1)) > summary(x.nls) > > Formula: y ~ a * x/(x * x + b * x + c) > > > > Parameters: > > Estimate Std. Error t value Pr(>|t|) > > a -0.0006002 0.0007155 -0.839 0.402 > > b 2.0704560 0.0305020 67.879 <2e-16 *** > > c 1.0724880 0.0314753 34.074 <2e-16 *** > > --- > > Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 > The 'actual' results depend much on the random numbers, but such a case > is strongly suggestive that our model is completely bogus, although the > mathematical engine has computed a value for a, b and c. The variability > in y is only insignificantly influenced by a mathematical term of the > form a*x / (x*x + b * x + c), being largely given by the variability in > the random numbers - rnorm(1000).] > > I hope this helps to explain some of my requests and previous comments. > > Sincerely, > > Leonard > > > Patrick Durusau wrote: > > Leonard, > > > > When you say: > > > >> The idea is: > >> I want a mechanism to specify the formula used in the regression. > >> Instead of storing a formula name, it would be wiser to store the > >> formula itself. This way, one can easily build *complex models* and > >> *multivariate models* (more than one variable). This is currently > >> not-possible and ODF lags behind professional packages in every > >> respect (well, Excel fares poor in this respect, too, but then you > >> shouldn't look at Excel when doing regressions). > > Would you suggest that we use R or something similar as the language > > for such models? (I have utterly no position one way or the other but > > would like to see us avoid having to define a language for such > > purposes and then seek implementers for it.) > > > > What would that mean in your experience for interchange? > > > > I know of R by the name but don't know its history or the level of > > support for various versions. > > > > Would this be a situation where the results of a model would be stored > > in case the document was processed by an application that lacked R > > support (assuming we chose that as the language)? > > > > Hope you are having a great day! > > > > Patrick > > > > Leonard Mada wrote: > >> Dear Laurent, > >> > >> I miss some frequently encountered regression types. > >> > >> The most frequent regression type on binary outcome variables is a > >> logistic regression. I therefore miss this one. > >> > >> However, what wonders me most, is the number of regression types > >> used. Well, to state it differently, there is a specific name for > >> every new regression type. > >> > >> There is a better alternative, and this alternative is already > >> implemented in the S+ language and in the open source R program. It > >> basically allows the user to specify the formula for the regression. > >> > >> There are basically 3 regression models: > >> > >> A.) Linear regression > >> - formulas of type: y = intercept + a1 * X1 + a2 * X2 + a3 * X3 + ... > >> - as seen, ODF doesn't permit a multivariate formula either, > >> i.e. X1, X2, X3, ... are different variables > >> > >> B.) Generalized linear models > >> - formulas differ slightly, but in the case of a logistic regression: > >> p(y) = 1 / (1 + 1/exp(intercept + a1 * X1 + a2 * X2 + a3 * X3 + ...) ) > >> where y is a binary variable and p(y) the probability of y > >> > >> C.) Non-Linear models > >> - this is the most interesting > >> - it allows specifying the formula for the regression > >> - e.g. lets say we want to determine the coefficients a & b for: > >> a * x / (x*x + b) > >> in R, this looks like: > >> model.nls <- nls( y ~ a*x / (x*x + b), start=list(a=1, b=1)) > >> where y is the outcome and x is the variable > >> > >> As a practical example: > >> [You can copy / paste this in R] > >> x <- rnorm(1000) # generate 1,000 random numbers > >> y <- rnorm(1000) + rnorm(1) * x / (x*x+1) > >> x.nls<-nls(y~a*x / (x*x+b*x+c), start=list(a=1,b=0,c=1)) > >> summary(x.nls) > >> > >>> Formula: y ~ a * x/(x * x + b * x + c) > >>> > >>> Parameters: > >>> Estimate Std. Error t value Pr(>|t|) > >>> a -1.83005 0.29605 -6.182 9.24e-10 *** > >>> b -0.04774 0.14024 -0.340 0.734 > >>> c 1.40155 0.35845 3.910 9.85e-05 *** > >>> --- > >>> Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 > >> > >> We see, "b" is statistically non-significant and we can remove it > >> from the model (giving us then a * x / (x*x + c); we can rerun the > >> regression using this formula to obtain a better result ). > >> > >> The idea is: > >> I want a mechanism to specify the formula used in the regression. > >> Instead of storing a formula name, it would be wiser to store the > >> formula itself. This way, one can easily build *complex models* and > >> *multivariate models* (more than one variable). This is currently > >> not-possible and ODF lags behind professional packages in every > >> respect (well, Excel fares poor in this respect, too, but then you > >> shouldn't look at Excel when doing regressions). > >> > >> Sincerely, > >> > >> Leonard > >> > >> > >> Laurent BALLAND-POIRIER wrote: > >>> Dear TC Members, > >>> > >>> Please find enclosed a file format modification demand that Ingrid > >>> Halama and me wrote. It is about regression curves in spreadsheet. Some > >>> data are missing in ODF to get compatibility with other spreadsheets > >>> such as MS-Excel or Gnumeric. Numerous issues will not be solved till > >>> these data can not be saved. > >>> I hope I post in the right place. If not, please explain where to send > >>> this demand. > >>> > >>> Best regards, > >>> > >>> Laurent BP > > > -- > This publicly archived list offers a means to provide input to the > OASIS Open Document Format for Office Applications (OpenDocument) TC. > > In order to verify user consent to the Feedback License terms and > to minimize spam in the list archive, subscription is required > before posting. > > Subscribe: office-comment-subscribe@lists.oasis-open.org > Unsubscribe: office-comment-unsubscribe@lists.oasis-open.org > List help: office-comment-help@lists.oasis-open.org > List archive: http://lists.oasis-open.org/archives/office-comment/ > Feedback License: http://www.oasis-open.org/who/ipr/feedback_license.pdf > List Guidelines: http://www.oasis-open.org/maillists/guidelines.php > Committee: http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=office >

**Follow-Ups**:**Re: [office-comment] Demand for modification of ODF file format aboutregression curve in spreadsheet***From:*Leonard Mada <discoleo@gmx.net>

**References**:**Re: [office-comment] Demand for modification of ODF file format aboutregression curve in spreadsheet***From:*Leonard Mada <discoleo@gmx.net>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]