office-comment message

Subject: Re: [office-comment] Demand for modification of ODF file format aboutregression curve in spreadsheet
From: Bryce L Nordgren <bnordgren@fs.fed.us>
To: Leonard Mada <discoleo@gmx.net>
Date: Mon, 8 Dec 2008 17:13:11 -0700
Would it be fair to observe that this change request highlights a 
shortcoming of the spreadsheet model itself?  Not the ODF spreadsheet 
model alone, mind you, but spreadsheets in general?  In particular, 
spreadsheets really have no mechanism to handle analytical equations at 
all.  They deal primarily with actual data values and not equations.  And 
so they enumerate a handful of the most common models which can be applied 
to the data.

The fact that cells may be assigned formulas is somewhat misleading, as 
there really is no such thing as a "variable".  A cell reference is just 
not the same thing.

Nearly all math apps approach this problem with either a strong 
data-oriented focus (Matlab, IDL, ...) or a strong analytical focus 
(Mathematica, ...).  There is always a need for both and so there is 
always some form of cross-pollenation.  For instance, when data-driven 
programs fit data to a model, they typically have no vocabulary to express 
the model they are fitting; and so return the fitted parameters as a 
vector or array.

The spreadsheet document has a strong data-oriented focus.  It has no 
vocabulary for expressing analytical equations, nor facilities for 
manipulating them.

If you are thinking of XML-izing something like S+ to add a strong 
analytical focus to the ODF suite of data content models, I would humbly 
suggest defining another document type.  Please note that there is already 
a "space" for it.  Annex C of v1.1 already mentions the extension *.odf 
and the MIME type application/vnd.oasis.opendocument.formula.  Yet the 
content model for such a file is unspecified.  So define the content 
model.  It is obviously intended to support some form of 
mathematically-oriented document. 

But take a minute to step back and really think about what you need.  If 
you do this with S+ (or IDL, or Matlab or Mathematica or any of them) you 
are in effect defining an orthodox scripting language for all applications 
which edit ODF files.  These languages are fully fleged procedural 
languages with their own standard libraries and their own mechanisms for 
including code not defined locally.  Would an augmentation of the 
<office:script/> element suffice?  Specify how a spreadsheet cell (or 
arbitrary other thing) would trigger execution of code within a specific 
script element, and how a script object declares its parameters and return 
values.  Specify how data types are handled (e.g., provide a common core 
that is currently comprehensible to the spreadsheet cell element, provide 
a means for the script to declare new data types that it understands...). 
Keep it language neutral and just define how you shove data in and how you 
accept data from the Big Black Box.  You'd also need to define how 
externally specified code is included.  The contents of the ODF file would 
need to contain the source code, and also a  CORBA/Interface Definition 
Language/other description of the <script/> element's contents.  Scripts 
aren't really all that useful if you cannot express how they interact with 
the data in the rest of the document.  They're also dangerous to rely on: 
you can't ensure that the person you're sending the document to will have 
the right language plugin, or that it will even be available for their 
machine/app.

I think both approaches have merits and pitfalls.  If you choose the route 
of selecting an orthodox scripting language, all conforming 
implementations should be interoperable.  Augmenting the script element 
encourages people to use "their favorite package", which may mean that you 
may not be able to recompute the regression line if you received the 
spreadsheet from someone who has a different pet package (either free or 
commercial), even if YOUR package has a routine which does the same thing 
(with a slightly different name/parameter list).

Bryce



Leonard Mada <discoleo@gmx.net> wrote on 12/08/2008 03:08:02 PM:

> Dear Patrick,
> 
> I indeed strongly suggest to use a dedicated program for every 
> regression. This means 2 things:
> 1.) use the dedicated language
>     [maybe xml-ized for ODF-compatibility]
> 2.) build a bridge/interface to the program itself
> 
> I will explain shortly why I find both issues relevant.
> 
> 1.) Use the language
> The S+ language is a mature language. There might exist alternatives 
> like in Octave, Mathematica, (...), but I am most accustomed to R (S+), 
> and this language seems also most suitable for spreadsheet users. The 
> other languages are more mathematically oriented. [Though ODF 3+ might 
> implement multiple language support. ;-) ]
> 
> Tex already allows implementing foreign code (to my knowledge), and 
> there is even an R plugin to process R-code embedded in an ODF-stream (I 

> think implemented at Novartis), but this is naturally a hack in the 
> sense that an ODF-aware application won't process that code, but R 
> itself will "process" the ODF-stream and replace the code with the 
> output (skipping actually the true ODF).
> 
> However, I believe that true open source means reusing existing 
> facilities. In this respect, ODF should reuse the S+ language when 
> useful within ODF. And I certainly think that S+ is more powerful than 
> ODF in every domain related to statistics (including regressions).
> 
> [IMPLEMENTATION DETAILS - there might exist better alternatives]
> What is needed is either to xml-ize the R-code and store it in the ODF, 
> or indeed store it as a specific code-block within the ODF. I actually 
> think that the R-code (S+ commands) can be easily xml-ized and stored as 

> true XML within the ODF stream (especially some parts like formulas for 
> regressions, which is actually a mathematical function).
> 
> One of my previous messages might also be interesting (although in a 
> different context):
> **Interface META-Functions**
> http://lists.oasis-open.org/archives/office-comment/200706/msg00019.html
> 
> 2.) Use R instead of internal regression engine
> This has 2 major advantages:
>  a.) R is more accurate and versatile, offering many more possibilities 
> and finer control
>  b.) R offers much more informations
> 
> a.) Versatile: Does not need to be discussed really. I hope my previous 
> example showed already the effects. Regarding accuracy, well, peeking 
> through the code you will easily see that numerical analysts have been 
> at work (please read this short message: 
> http://sc.openoffice.org/servlets/ReadMsg?listName=dev&msgNo=3161).
> 
> b.) A nice consequence of my previous example was the fact that I used 
> initially a slightly wrong formula:
>  a * x / (x*x + b * x + c)
> 
> But then, beyond the coefficients a, b and c, R calculated also some 
> statistics, which allowed me to drop the term (b* x) from the model, and 

> recompute the coefficients.
> 
> Actually, a mathematical engine will (almost) always compute something. 
> The real problem is, IF that something makes sense.
> 
> You do NOT get this information using a spreadsheet. [This is one of the 

> main reasons I do not use regressions in spreadsheets.]
> [On a side note, IF the coefficient 'a' was very close to 0, R would 
> have marked 'a' as statistically insignificant, but 'b' as significant. 
> Since 'a cannot be insignificant, that would mean that the model is 
> flawed and is NOT well represented by that equation - independent of the 

> actually computed value for 'a', e.g.:
> y <- rnorm(1000) + 3*10^-16 * x / (x*x+1)
> x.nls<-nls(y~a*x / (x*x+b*x+c), start=list(a=1,b=0,c=1))
> summary(x.nls)
> > Formula: y ~ a * x/(x * x + b * x + c)
> >
> > Parameters:
> >     Estimate Std. Error t value Pr(>|t|) 
> > a -0.0006002  0.0007155  -0.839    0.402 
> > b  2.0704560  0.0305020  67.879   <2e-16 ***
> > c  1.0724880  0.0314753  34.074   <2e-16 ***
> > ---
> > Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 
> The 'actual' results depend much on the random numbers, but such a case 
> is strongly suggestive that our model is completely bogus, although the 
> mathematical engine has computed a value for a, b and c. The variability 

> in y is only insignificantly influenced by a mathematical term of the 
> form a*x / (x*x + b * x + c), being largely given by the variability in 
> the random numbers - rnorm(1000).]
> 
> I hope this helps to explain some of my requests and previous comments.
> 
> Sincerely,
> 
> Leonard
> 
> 
> Patrick Durusau wrote:
> > Leonard,
> >
> > When you say:
> >
> >> The idea is:
> >> I want a mechanism to specify the formula used in the regression. 
> >> Instead of storing a formula name, it would be wiser to store the 
> >> formula itself. This way, one can easily build *complex models* and 
> >> *multivariate models* (more than one variable). This is currently 
> >> not-possible and ODF lags behind professional packages in every 
> >> respect (well, Excel fares poor in this respect, too, but then you 
> >> shouldn't look at Excel when doing regressions). 
> > Would you suggest that we use R or something similar as the language 
> > for such models? (I have utterly no position one way or the other but 
> > would like to see us avoid having to define a language for such 
> > purposes and then seek implementers for it.)
> >
> > What would that mean in your experience for interchange?
> >
> > I know of R by the name but don't know its history or the level of 
> > support for various versions.
> >
> > Would this be a situation where the results of a model would be stored 

> > in case the document was processed by an application that lacked R 
> > support (assuming we chose that as the language)?
> >
> > Hope you are having a great day!
> >
> > Patrick
> >
> > Leonard Mada wrote:
> >> Dear Laurent,
> >>
> >> I miss some frequently encountered regression types.
> >>
> >> The most frequent regression type on binary outcome variables is a 
> >> logistic regression. I therefore miss this one.
> >>
> >> However, what wonders me most, is the number of regression types 
> >> used. Well, to state it differently, there is a specific name for 
> >> every new regression type.
> >>
> >> There is a better alternative, and this alternative is already 
> >> implemented in the S+ language and in the open source R program. It 
> >> basically allows the user to specify the formula for the regression.
> >>
> >> There are basically 3 regression models:
> >>
> >> A.) Linear regression
> >> - formulas of type: y = intercept + a1 * X1 + a2 * X2 + a3 * X3 + ...
> >> - as seen, ODF doesn't permit a multivariate formula either,
> >> i.e. X1, X2, X3, ... are different variables
> >>
> >> B.) Generalized linear models
> >> - formulas differ slightly, but in the case of a logistic regression:
> >> p(y) = 1 / (1 + 1/exp(intercept + a1 * X1 + a2 * X2 + a3 * X3 + ...) 
)
> >> where y is a binary variable and p(y) the probability of y
> >>
> >> C.) Non-Linear models
> >> - this is the most interesting
> >> - it allows specifying the formula for the regression
> >> - e.g. lets say we want to determine the coefficients a & b for:
> >> a * x / (x*x + b)
> >> in R, this looks like:
> >> model.nls <- nls( y ~ a*x / (x*x + b), start=list(a=1, b=1))
> >> where y is the outcome and x is the variable
> >>
> >> As a practical example:
> >> [You can copy / paste this in R]
> >> x <- rnorm(1000) # generate 1,000 random numbers
> >> y <- rnorm(1000) + rnorm(1) * x / (x*x+1)
> >> x.nls<-nls(y~a*x / (x*x+b*x+c), start=list(a=1,b=0,c=1))
> >> summary(x.nls)
> >>
> >>> Formula: y ~ a * x/(x * x + b * x + c)
> >>>
> >>> Parameters:
> >>> Estimate Std. Error t value Pr(>|t|)
> >>> a -1.83005 0.29605 -6.182 9.24e-10 ***
> >>> b -0.04774 0.14024 -0.340 0.734
> >>> c 1.40155 0.35845 3.910 9.85e-05 ***
> >>> ---
> >>> Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1
> >>
> >> We see, "b" is statistically non-significant and we can remove it 
> >> from the model (giving us then a * x / (x*x + c); we can rerun the 
> >> regression using this formula to obtain a better result ).
> >>
> >> The idea is:
> >> I want a mechanism to specify the formula used in the regression. 
> >> Instead of storing a formula name, it would be wiser to store the 
> >> formula itself. This way, one can easily build *complex models* and 
> >> *multivariate models* (more than one variable). This is currently 
> >> not-possible and ODF lags behind professional packages in every 
> >> respect (well, Excel fares poor in this respect, too, but then you 
> >> shouldn't look at Excel when doing regressions).
> >>
> >> Sincerely,
> >>
> >> Leonard
> >>
> >>
> >> Laurent BALLAND-POIRIER wrote:
> >>> Dear TC Members,
> >>>
> >>> Please find enclosed a file format modification demand that Ingrid
> >>> Halama and me wrote. It is about regression curves in spreadsheet. 
Some
> >>> data are missing in ODF to get compatibility with other spreadsheets
> >>> such as MS-Excel or Gnumeric. Numerous issues will not be solved 
till
> >>> these data can not be saved.
> >>> I hope I post in the right place. If not, please explain where to 
send
> >>> this demand.
> >>>
> >>> Best regards,
> >>>
> >>> Laurent BP
> 
> 
> -- 
> This publicly archived list offers a means to provide input to the
> OASIS Open Document Format for Office Applications (OpenDocument) TC.
> 
> In order to verify user consent to the Feedback License terms and
> to minimize spam in the list archive, subscription is required
> before posting.
> 
> Subscribe: office-comment-subscribe@lists.oasis-open.org
> Unsubscribe: office-comment-unsubscribe@lists.oasis-open.org
> List help: office-comment-help@lists.oasis-open.org
> List archive: http://lists.oasis-open.org/archives/office-comment/
> Feedback License: http://www.oasis-open.org/who/ipr/feedback_license.pdf
> List Guidelines: http://www.oasis-open.org/maillists/guidelines.php
> Committee: 
http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=office
>
Follow-Ups:
- Re: [office-comment] Demand for modification of ODF file format aboutregression curve in spreadsheet
  - From: Leonard Mada <discoleo@gmx.net>
References:
- Re: [office-comment] Demand for modification of ODF file format aboutregression curve in spreadsheet
  - From: Leonard Mada <discoleo@gmx.net>