office-comment message

Subject: Re: [office-comment] Demand for modification of ODF file format aboutregression curve in spreadsheet
From: Leonard Mada <discoleo@gmx.net>
To: office-comment@lists.oasis-open.org
Date: Fri, 12 Dec 2008 00:19:49 +0200
Hello list members,


Bryce L Nordgren wrote:
> Would it be fair to observe that this change request highlights a 
> shortcoming of the spreadsheet model itself?  Not the ODF spreadsheet 
> model alone, mind you, but spreadsheets in general?
> [...]
> The spreadsheet document has a strong data-oriented focus.  It has no 
> vocabulary for expressing analytical equations, nor facilities for 
> manipulating them.
>   

While I agree to some point with the previous comments, I have to 
digress slightly. I will discuss some of these points below.

I will start however with S+: this language is different from the other 
mentioned languages, because it is both data-driven, but is also 
strongly object-oriented with a strong emphasis on statistical objects. 
It can represent well a lot of mathematical objects, therefore it has 
really elements from both camps.

S+ works with vectors and more complex data-objects (e.g. data-frames), 
which are even more powerful than what spreadsheets offer. Therefore, S+ 
(or a subset of it) would be quite applicable to spreadsheets.

Although spreadsheets seem not to define variables, named ranges may 
well be seen as a variable, in the sense that it can be used within 
another formula. This is even in line with S+: data and variables are 
the same, and could be the equivalent of a named range, e.g.:

my.dataframe<-data.frame(A=c(1:10),B=rnorm(10),C=0)
my.dataframe
>     A           B C
> 1   1  1.37139420 0
> 2   2 -0.12350248 0
> 3   3 -0.07561494 0
> 4   4  0.40629086 0
> 5   5 -0.75204552 0
> 6   6 -1.92589182 0
> 7   7  0.45197366 0
> 8   8 -0.62305795 0
> 9   9  0.29901475 0
> 10 10 -0.75870452 0

So, the data frame resembles well a range with column headers and should 
be the building block of every spreadsheet. Of course, S+ accepts much 
more complex objects, but these basic ones are the essential ones. This 
shouldn't be too unexpected, because statistics involves data, and this 
data is sometimes acquired using spreadsheets (and sometimes databases), 
but there is always strong emphasis on data.

A second thought involves some other ODF-elements:
a.) the MATH-ML formula
b.) OpenFormula

So, basically ODF tries to represent mathematical equations in 
text-docuemnts, and it tries to define something formula-oriented in 
spreadsheets. But it fails to capture the most important formula-related 
features, namely expressing a formula.

That is why, I still consider that *it is critical* to ODF to actually 
correctly represent formulas.

And regressions are just the perfect place to start with formulas. 
Regressions *are* formulas, or more precisely modeling data on a formula.

I strongly suggest both expressing the regression formula as a true 
formula, and also offering mechanisms to store a script that allows 
computing externally the regression. An ODF-compliant application might 
implement algorithms to fit a model internally, or would be able to fit 
a model through accessing some external program or plugin.

A formula would clearly make ODF more flexible. Now, if a spreadsheet 
implementation wants to implement a new regression method, it needs to 
get ODF-approval. With the ability to specify a formula, a particular 
implementation might well perform complex regressions. A niche product 
used e.g. in some astrophysical research might well implement a very 
special regression model and still be ODF-compliant. And all 
ODF-compliant applications would be still able to read the formula, 
though not able to compute it (to fit the model). [Some regression types 
might be named mandatory for every ODF-compliant application, while any 
non-linear models should probably be left as optional.]

Sincerely,

Leonard

> If you are thinking of XML-izing something like S+ to add a strong 
> analytical focus to the ODF suite of data content models, I would humbly 
> suggest defining another document type.  Please note that there is already 
> a "space" for it.  Annex C of v1.1 already mentions the extension *.odf 
> and the MIME type application/vnd.oasis.opendocument.formula.  Yet the 
> content model for such a file is unspecified.  So define the content 
> model.  It is obviously intended to support some form of 
> mathematically-oriented document. 
>
> But take a minute to step back and really think about what you need.  If 
> you do this with S+ (or IDL, or Matlab or Mathematica or any of them) you 
> are in effect defining an orthodox scripting language for all applications 
> which edit ODF files.  These languages are fully fleged procedural 
> languages with their own standard libraries and their own mechanisms for 
> including code not defined locally.  Would an augmentation of the 
> <office:script/> element suffice?  Specify how a spreadsheet cell (or 
> arbitrary other thing) would trigger execution of code within a specific 
> script element, and how a script object declares its parameters and return 
> values.  Specify how data types are handled (e.g., provide a common core 
> that is currently comprehensible to the spreadsheet cell element, provide 
> a means for the script to declare new data types that it understands...). 
> Keep it language neutral and just define how you shove data in and how you 
> accept data from the Big Black Box.  You'd also need to define how 
> externally specified code is included.  The contents of the ODF file would 
> need to contain the source code, and also a  CORBA/Interface Definition 
> Language/other description of the <script/> element's contents.  Scripts 
> aren't really all that useful if you cannot express how they interact with 
> the data in the rest of the document.  They're also dangerous to rely on: 
> you can't ensure that the person you're sending the document to will have 
> the right language plugin, or that it will even be available for their 
> machine/app.
>
> I think both approaches have merits and pitfalls.  If you choose the route 
> of selecting an orthodox scripting language, all conforming 
> implementations should be interoperable.  Augmenting the script element 
> encourages people to use "their favorite package", which may mean that you 
> may not be able to recompute the regression line if you received the 
> spreadsheet from someone who has a different pet package (either free or 
> commercial), even if YOUR package has a routine which does the same thing 
> (with a slightly different name/parameter list).
>
> Bryce
>
>
>
> Leonard Mada <discoleo@gmx.net> wrote on 12/08/2008 03:08:02 PM:
>
>   
>> Dear Patrick,
>>
>> I indeed strongly suggest to use a dedicated program for every 
>> regression. This means 2 things:
>> 1.) use the dedicated language
>>     [maybe xml-ized for ODF-compatibility]
>> 2.) build a bridge/interface to the program itself
>>
>> I will explain shortly why I find both issues relevant.
>>
>> 1.) Use the language
>> The S+ language is a mature language. There might exist alternatives 
>> like in Octave, Mathematica, (...), but I am most accustomed to R (S+), 
>> and this language seems also most suitable for spreadsheet users. The 
>> other languages are more mathematically oriented. [Though ODF 3+ might 
>> implement multiple language support. ;-) ]
>>
>> Tex already allows implementing foreign code (to my knowledge), and 
>> there is even an R plugin to process R-code embedded in an ODF-stream (I 
>>     
>
>   
>> think implemented at Novartis), but this is naturally a hack in the 
>> sense that an ODF-aware application won't process that code, but R 
>> itself will "process" the ODF-stream and replace the code with the 
>> output (skipping actually the true ODF).
>>
>> However, I believe that true open source means reusing existing 
>> facilities. In this respect, ODF should reuse the S+ language when 
>> useful within ODF. And I certainly think that S+ is more powerful than 
>> ODF in every domain related to statistics (including regressions).
>>
>> [IMPLEMENTATION DETAILS - there might exist better alternatives]
>> What is needed is either to xml-ize the R-code and store it in the ODF, 
>> or indeed store it as a specific code-block within the ODF. I actually 
>> think that the R-code (S+ commands) can be easily xml-ized and stored as 
>>     
>
>   
>> true XML within the ODF stream (especially some parts like formulas for 
>> regressions, which is actually a mathematical function).
>>
>> One of my previous messages might also be interesting (although in a 
>> different context):
>> **Interface META-Functions**
>> http://lists.oasis-open.org/archives/office-comment/200706/msg00019.html
>>
>> 2.) Use R instead of internal regression engine
>> This has 2 major advantages:
>>  a.) R is more accurate and versatile, offering many more possibilities 
>> and finer control
>>  b.) R offers much more informations
>>
>> a.) Versatile: Does not need to be discussed really. I hope my previous 
>> example showed already the effects. Regarding accuracy, well, peeking 
>> through the code you will easily see that numerical analysts have been 
>> at work (please read this short message: 
>> http://sc.openoffice.org/servlets/ReadMsg?listName=dev&msgNo=3161).
>>
>> b.) A nice consequence of my previous example was the fact that I used 
>> initially a slightly wrong formula:
>>  a * x / (x*x + b * x + c)
>>
>> But then, beyond the coefficients a, b and c, R calculated also some 
>> statistics, which allowed me to drop the term (b* x) from the model, and 
>>     
>
>   
>> recompute the coefficients.
>>
>> Actually, a mathematical engine will (almost) always compute something. 
>> The real problem is, IF that something makes sense.
>>
>> You do NOT get this information using a spreadsheet. [This is one of the 
>>     
>
>   
>> main reasons I do not use regressions in spreadsheets.]
>> [On a side note, IF the coefficient 'a' was very close to 0, R would 
>> have marked 'a' as statistically insignificant, but 'b' as significant. 
>> Since 'a cannot be insignificant, that would mean that the model is 
>> flawed and is NOT well represented by that equation - independent of the 
>>     
>
>   
>> actually computed value for 'a', e.g.:
>> y <- rnorm(1000) + 3*10^-16 * x / (x*x+1)
>> x.nls<-nls(y~a*x / (x*x+b*x+c), start=list(a=1,b=0,c=1))
>> summary(x.nls)
>>     
>>> Formula: y ~ a * x/(x * x + b * x + c)
>>>
>>> Parameters:
>>>     Estimate Std. Error t value Pr(>|t|) 
>>> a -0.0006002  0.0007155  -0.839    0.402 
>>> b  2.0704560  0.0305020  67.879   <2e-16 ***
>>> c  1.0724880  0.0314753  34.074   <2e-16 ***
>>> ---
>>> Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 
>>>       
>> The 'actual' results depend much on the random numbers, but such a case 
>> is strongly suggestive that our model is completely bogus, although the 
>> mathematical engine has computed a value for a, b and c. The variability 
>>     
>
>   
>> in y is only insignificantly influenced by a mathematical term of the 
>> form a*x / (x*x + b * x + c), being largely given by the variability in 
>> the random numbers - rnorm(1000).]
>>
>> I hope this helps to explain some of my requests and previous comments.
>>
>> Sincerely,
>>
>> Leonard
>>
>>
>> Patrick Durusau wrote:
>>     
>>> Leonard,
>>>
>>> When you say:
>>>
>>>       
>>>> The idea is:
>>>> I want a mechanism to specify the formula used in the regression. 
>>>> Instead of storing a formula name, it would be wiser to store the 
>>>> formula itself. This way, one can easily build *complex models* and 
>>>> *multivariate models* (more than one variable). This is currently 
>>>> not-possible and ODF lags behind professional packages in every 
>>>> respect (well, Excel fares poor in this respect, too, but then you 
>>>> shouldn't look at Excel when doing regressions). 
>>>>         
>>> Would you suggest that we use R or something similar as the language 
>>> for such models? (I have utterly no position one way or the other but 
>>> would like to see us avoid having to define a language for such 
>>> purposes and then seek implementers for it.)
>>>
>>> What would that mean in your experience for interchange?
>>>
>>> I know of R by the name but don't know its history or the level of 
>>> support for various versions.
>>>
>>> Would this be a situation where the results of a model would be stored 
>>>       
>
>   
>>> in case the document was processed by an application that lacked R 
>>> support (assuming we chose that as the language)?
>>>
>>> Hope you are having a great day!
>>>
>>> Patrick
>>>
>>> Leonard Mada wrote:
>>>       
>>>> Dear Laurent,
>>>>
>>>> I miss some frequently encountered regression types.
>>>>
>>>> The most frequent regression type on binary outcome variables is a 
>>>> logistic regression. I therefore miss this one.
>>>>
>>>> However, what wonders me most, is the number of regression types 
>>>> used. Well, to state it differently, there is a specific name for 
>>>> every new regression type.
>>>>
>>>> There is a better alternative, and this alternative is already 
>>>> implemented in the S+ language and in the open source R program. It 
>>>> basically allows the user to specify the formula for the regression.
>>>>
>>>> There are basically 3 regression models:
>>>>
>>>> A.) Linear regression
>>>> - formulas of type: y = intercept + a1 * X1 + a2 * X2 + a3 * X3 + ...
>>>> - as seen, ODF doesn't permit a multivariate formula either,
>>>> i.e. X1, X2, X3, ... are different variables
>>>>
>>>> B.) Generalized linear models
>>>> - formulas differ slightly, but in the case of a logistic regression:
>>>> p(y) = 1 / (1 + 1/exp(intercept + a1 * X1 + a2 * X2 + a3 * X3 + ...) 
>>>>         
> )
>   
>>>> where y is a binary variable and p(y) the probability of y
>>>>
>>>> C.) Non-Linear models
>>>> - this is the most interesting
>>>> - it allows specifying the formula for the regression
>>>> - e.g. lets say we want to determine the coefficients a & b for:
>>>> a * x / (x*x + b)
>>>> in R, this looks like:
>>>> model.nls <- nls( y ~ a*x / (x*x + b), start=list(a=1, b=1))
>>>> where y is the outcome and x is the variable
>>>>
>>>> As a practical example:
>>>> [You can copy / paste this in R]
>>>> x <- rnorm(1000) # generate 1,000 random numbers
>>>> y <- rnorm(1000) + rnorm(1) * x / (x*x+1)
>>>> x.nls<-nls(y~a*x / (x*x+b*x+c), start=list(a=1,b=0,c=1))
>>>> summary(x.nls)
>>>>
>>>>         
>>>>> Formula: y ~ a * x/(x * x + b * x + c)
>>>>>
>>>>> Parameters:
>>>>> Estimate Std. Error t value Pr(>|t|)
>>>>> a -1.83005 0.29605 -6.182 9.24e-10 ***
>>>>> b -0.04774 0.14024 -0.340 0.734
>>>>> c 1.40155 0.35845 3.910 9.85e-05 ***
>>>>> ---
>>>>> Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1
>>>>>           
>>>> We see, "b" is statistically non-significant and we can remove it 
>>>> from the model (giving us then a * x / (x*x + c); we can rerun the 
>>>> regression using this formula to obtain a better result ).
>>>>
>>>> The idea is:
>>>> I want a mechanism to specify the formula used in the regression. 
>>>> Instead of storing a formula name, it would be wiser to store the 
>>>> formula itself. This way, one can easily build *complex models* and 
>>>> *multivariate models* (more than one variable). This is currently 
>>>> not-possible and ODF lags behind professional packages in every 
>>>> respect (well, Excel fares poor in this respect, too, but then you 
>>>> shouldn't look at Excel when doing regressions).
>>>>
>>>> Sincerely,
>>>>
>>>> Leonard
>>>>
>>>>
>>>> Laurent BALLAND-POIRIER wrote:
>>>>         
>>>>> Dear TC Members,
>>>>>
>>>>> Please find enclosed a file format modification demand that Ingrid
>>>>> Halama and me wrote. It is about regression curves in spreadsheet. 
>>>>>           
> Some
>   
>>>>> data are missing in ODF to get compatibility with other spreadsheets
>>>>> such as MS-Excel or Gnumeric. Numerous issues will not be solved 
>>>>>           
> till
>   
>>>>> these data can not be saved.
>>>>> I hope I post in the right place. If not, please explain where to 
>>>>>           
> send
>   
>>>>> this demand.
>>>>>
>>>>> Best regards,
>>>>>
>>>>> Laurent BP
>>>>>           
>> -- 
>> This publicly archived list offers a means to provide input to the
>> OASIS Open Document Format for Office Applications (OpenDocument) TC.
>>
>> In order to verify user consent to the Feedback License terms and
>> to minimize spam in the list archive, subscription is required
>> before posting.
>>
>> Subscribe: office-comment-subscribe@lists.oasis-open.org
>> Unsubscribe: office-comment-unsubscribe@lists.oasis-open.org
>> List help: office-comment-help@lists.oasis-open.org
>> List archive: http://lists.oasis-open.org/archives/office-comment/
>> Feedback License: http://www.oasis-open.org/who/ipr/feedback_license.pdf
>> List Guidelines: http://www.oasis-open.org/maillists/guidelines.php
>> Committee: 
>>     
> http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=office
>   
>
>
References:
- Re: [office-comment] Demand for modification of ODF file format aboutregression curve in spreadsheet
  - From: Bryce L Nordgren <bnordgren@fs.fed.us>