office-comment message

Subject: Re: [office-comment] Demand for modification of ODF file format aboutregression curve in spreadsheet
From: Leonard Mada <discoleo@gmx.net>
To: office-comment@lists.oasis-open.org
Date: Tue, 09 Dec 2008 00:08:02 +0200
Dear Patrick,

I indeed strongly suggest to use a dedicated program for every 
regression. This means 2 things:
1.) use the dedicated language
    [maybe xml-ized for ODF-compatibility]
2.) build a bridge/interface to the program itself

I will explain shortly why I find both issues relevant.

1.) Use the language
The S+ language is a mature language. There might exist alternatives 
like in Octave, Mathematica, (...), but I am most accustomed to R (S+), 
and this language seems also most suitable for spreadsheet users. The 
other languages are more mathematically oriented. [Though ODF 3+ might 
implement multiple language support. ;-) ]

Tex already allows implementing foreign code (to my knowledge), and 
there is even an R plugin to process R-code embedded in an ODF-stream (I 
think implemented at Novartis), but this is naturally a hack in the 
sense that an ODF-aware application won't process that code, but R 
itself will "process" the ODF-stream and replace the code with the 
output (skipping actually the true ODF).

However, I believe that true open source means reusing existing 
facilities. In this respect, ODF should reuse the S+ language when 
useful within ODF. And I certainly think that S+ is more powerful than 
ODF in every domain related to statistics (including regressions).

[IMPLEMENTATION DETAILS - there might exist better alternatives]
What is needed is either to xml-ize the R-code and store it in the ODF, 
or indeed store it as a specific code-block within the ODF. I actually 
think that the R-code (S+ commands) can be easily xml-ized and stored as 
true XML within the ODF stream (especially some parts like formulas for 
regressions, which is actually a mathematical function).

One of my previous messages might also be interesting (although in a 
different context):
**Interface META-Functions**
http://lists.oasis-open.org/archives/office-comment/200706/msg00019.html

2.) Use R instead of internal regression engine
This has 2 major advantages:
 a.) R is more accurate and versatile, offering many more possibilities 
and finer control
 b.) R offers much more informations

a.) Versatile: Does not need to be discussed really. I hope my previous 
example showed already the effects. Regarding accuracy, well, peeking 
through the code you will easily see that numerical analysts have been 
at work (please read this short message: 
http://sc.openoffice.org/servlets/ReadMsg?listName=dev&msgNo=3161).

b.) A nice consequence of my previous example was the fact that I used 
initially a slightly wrong formula:
 a * x / (x*x + b * x + c)

But then, beyond the coefficients a, b and c, R calculated also some 
statistics, which allowed me to drop the term (b* x) from the model, and 
recompute the coefficients.

Actually, a mathematical engine will (almost) always compute something. 
The real problem is, IF that something makes sense.

You do NOT get this information using a spreadsheet. [This is one of the 
main reasons I do not use regressions in spreadsheets.]
[On a side note, IF the coefficient 'a' was very close to 0, R would 
have marked 'a' as statistically insignificant, but 'b' as significant. 
Since 'a cannot be insignificant, that would mean that the model is 
flawed and is NOT well represented by that equation - independent of the 
actually computed value for 'a', e.g.:
y <- rnorm(1000) + 3*10^-16 * x / (x*x+1)
x.nls<-nls(y~a*x / (x*x+b*x+c), start=list(a=1,b=0,c=1))
summary(x.nls)
> Formula: y ~ a * x/(x * x + b * x + c)
>
> Parameters:
>     Estimate Std. Error t value Pr(>|t|)   
> a -0.0006002  0.0007155  -0.839    0.402   
> b  2.0704560  0.0305020  67.879   <2e-16 ***
> c  1.0724880  0.0314753  34.074   <2e-16 ***
> ---
> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
The 'actual' results depend much on the random numbers, but such a case 
is strongly suggestive that our model is completely bogus, although the 
mathematical engine has computed a value for a, b and c. The variability 
in y is only insignificantly influenced by a mathematical term of the 
form a*x / (x*x + b * x + c), being largely given by the variability in 
the random numbers - rnorm(1000).]

I hope this helps to explain some of my requests and previous comments.

Sincerely,

Leonard


Patrick Durusau wrote:
> Leonard,
>
> When you say:
>
>> The idea is:
>> I want a mechanism to specify the formula used in the regression. 
>> Instead of storing a formula name, it would be wiser to store the 
>> formula itself. This way, one can easily build *complex models* and 
>> *multivariate models* (more than one variable). This is currently 
>> not-possible and ODF lags behind professional packages in every 
>> respect (well, Excel fares poor in this respect, too, but then you 
>> shouldn't look at Excel when doing regressions). 
> Would you suggest that we use R or something similar as the language 
> for such models? (I have utterly no position one way or the other but 
> would like to see us avoid having to define a language for such 
> purposes and then seek implementers for it.)
>
> What would that mean in your experience for interchange?
>
> I know of R by the name but don't know its history or the level of 
> support for various versions.
>
> Would this be a situation where the results of a model would be stored 
> in case the document was processed by an application that lacked R 
> support (assuming we chose that as the language)?
>
> Hope you are having a great day!
>
> Patrick
>
> Leonard Mada wrote:
>> Dear Laurent,
>>
>> I miss some frequently encountered regression types.
>>
>> The most frequent regression type on binary outcome variables is a 
>> logistic regression. I therefore miss this one.
>>
>> However, what wonders me most, is the number of regression types 
>> used. Well, to state it differently, there is a specific name for 
>> every new regression type.
>>
>> There is a better alternative, and this alternative is already 
>> implemented in the S+ language and in the open source R program. It 
>> basically allows the user to specify the formula for the regression.
>>
>> There are basically 3 regression models:
>>
>> A.) Linear regression
>> - formulas of type: y = intercept + a1 * X1 + a2 * X2 + a3 * X3 + ...
>> - as seen, ODF doesn't permit a multivariate formula either,
>> i.e. X1, X2, X3, ... are different variables
>>
>> B.) Generalized linear models
>> - formulas differ slightly, but in the case of a logistic regression:
>> p(y) = 1 / (1 + 1/exp(intercept + a1 * X1 + a2 * X2 + a3 * X3 + ...) )
>> where y is a binary variable and p(y) the probability of y
>>
>> C.) Non-Linear models
>> - this is the most interesting
>> - it allows specifying the formula for the regression
>> - e.g. lets say we want to determine the coefficients a & b for:
>> a * x / (x*x + b)
>> in R, this looks like:
>> model.nls <- nls( y ~ a*x / (x*x + b), start=list(a=1, b=1))
>> where y is the outcome and x is the variable
>>
>> As a practical example:
>> [You can copy / paste this in R]
>> x <- rnorm(1000) # generate 1,000 random numbers
>> y <- rnorm(1000) + rnorm(1) * x / (x*x+1)
>> x.nls<-nls(y~a*x / (x*x+b*x+c), start=list(a=1,b=0,c=1))
>> summary(x.nls)
>>
>>> Formula: y ~ a * x/(x * x + b * x + c)
>>>
>>> Parameters:
>>> Estimate Std. Error t value Pr(>|t|)
>>> a -1.83005 0.29605 -6.182 9.24e-10 ***
>>> b -0.04774 0.14024 -0.340 0.734
>>> c 1.40155 0.35845 3.910 9.85e-05 ***
>>> ---
>>> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>>
>> We see, "b" is statistically non-significant and we can remove it 
>> from the model (giving us then a * x / (x*x + c); we can rerun the 
>> regression using this formula to obtain a better result ).
>>
>> The idea is:
>> I want a mechanism to specify the formula used in the regression. 
>> Instead of storing a formula name, it would be wiser to store the 
>> formula itself. This way, one can easily build *complex models* and 
>> *multivariate models* (more than one variable). This is currently 
>> not-possible and ODF lags behind professional packages in every 
>> respect (well, Excel fares poor in this respect, too, but then you 
>> shouldn't look at Excel when doing regressions).
>>
>> Sincerely,
>>
>> Leonard
>>
>>
>> Laurent BALLAND-POIRIER wrote:
>>> Dear TC Members,
>>>
>>> Please find enclosed a file format modification demand that Ingrid
>>> Halama and me wrote. It is about regression curves in spreadsheet. Some
>>> data are missing in ODF to get compatibility with other spreadsheets
>>> such as MS-Excel or Gnumeric. Numerous issues will not be solved till
>>> these data can not be saved.
>>> I hope I post in the right place. If not, please explain where to send
>>> this demand.
>>>
>>> Best regards,
>>>
>>> Laurent BP
Follow-Ups:
- Re: [office-comment] Demand for modification of ODF file format aboutregression curve in spreadsheet
  - From: Bryce L Nordgren <bnordgren@fs.fed.us>
References:
- Demand for modification of ODF file format about regression curvein spreadsheet
  - From: Laurent BALLAND-POIRIER <Laurent.Balland-Poirier@laposte.net>
- Re: [office-comment] Demand for modification of ODF file format aboutregression curve in spreadsheet
  - From: Leonard Mada <discoleo@gmx.net>
- Re: [office-comment] Demand for modification of ODF file format aboutregression curve in spreadsheet
  - From: Patrick Durusau <patrick@durusau.net>