office-comment message

Subject: Re: [office-comment] Demand for modification of ODF file format aboutregression curve in spreadsheet

From: Leonard Mada <discoleo@gmx.net>
To: Laurent BALLAND-POIRIER <Laurent.Balland-Poirier@laposte.net>
Date: Mon, 08 Dec 2008 21:59:48 +0200

Dear Laurent,

I miss some frequently encountered regression types.

The most frequent regression type on binary outcome variables is a 
logistic regression. I therefore miss this one.

However, what wonders me most, is the number of regression types used. 
Well, to state it differently, there is a specific name for every new 
regression type.

There is a better alternative, and this alternative is already 
implemented in the S+ language and in the open source R program. It 
basically allows the user to specify the formula for the regression.

There are basically 3 regression models:

A.) Linear regression
- formulas of type: y = intercept + a1 * X1 + a2 * X2 + a3 * X3 + ...
- as seen, ODF doesn't permit a multivariate formula either,
i.e. X1, X2, X3, ... are different variables

B.) Generalized linear models
- formulas differ slightly, but in the case of a logistic regression:
p(y) = 1 / (1 + 1/exp(intercept + a1 * X1 + a2 * X2 + a3 * X3 + ...) )
where y is a binary variable and p(y) the probability of y

C.) Non-Linear models
- this is the most interesting
- it allows specifying the formula for the regression
- e.g. lets say we want to determine the coefficients a & b for:
a * x / (x*x + b)
in R, this looks like:
model.nls <- nls( y ~ a*x / (x*x + b), start=list(a=1, b=1))
where y is the outcome and x is the variable

As a practical example:
[You can copy / paste this in R]
x <- rnorm(1000) # generate 1,000 random numbers
y <- rnorm(1000) + rnorm(1) * x / (x*x+1)
x.nls<-nls(y~a*x / (x*x+b*x+c), start=list(a=1,b=0,c=1))
summary(x.nls)

> Formula: y ~ a * x/(x * x + b * x + c)
>
> Parameters:
> Estimate Std. Error t value Pr(>|t|)
> a -1.83005 0.29605 -6.182 9.24e-10 ***
> b -0.04774 0.14024 -0.340 0.734
> c 1.40155 0.35845 3.910 9.85e-05 ***
> ---
> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

We see, "b" is statistically non-significant and we can remove it from 
the model (giving us then a * x / (x*x + c); we can rerun the regression 
using this formula to obtain a better result ).

The idea is:
I want a mechanism to specify the formula used in the regression. 
Instead of storing a formula name, it would be wiser to store the 
formula itself. This way, one can easily build *complex models* and 
*multivariate models* (more than one variable). This is currently 
not-possible and ODF lags behind professional packages in every respect 
(well, Excel fares poor in this respect, too, but then you shouldn't 
look at Excel when doing regressions).

Sincerely,

Leonard

Laurent BALLAND-POIRIER wrote:
> Dear TC Members,
>
> Please find enclosed a file format modification demand that Ingrid
> Halama and me wrote. It is about regression curves in spreadsheet. Some
> data are missing in ODF to get compatibility with other spreadsheets
> such as MS-Excel or Gnumeric. Numerous issues will not be solved till
> these data can not be saved.
> I hope I post in the right place. If not, please explain where to send
> this demand.
>
> Best regards,
>
> Laurent BP
>

Follow-Ups:
- Re: [office-comment] Demand for modification of ODF file format aboutregression curve in spreadsheet
  - From: Patrick Durusau <patrick@durusau.net>

References:
- Demand for modification of ODF file format about regression curvein spreadsheet
  - From: Laurent BALLAND-POIRIER <Laurent.Balland-Poirier@laposte.net>