Subject: Re: [office-comment] Demand for modification of ODF file format aboutregression curve in spreadsheet

Dear Laurent,

I miss some frequently encountered regression types.

The most frequent regression type on binary outcome variables is a 
logistic regression. I therefore miss this one.

However, what wonders me most, is the number of regression types used. 
Well, to state it differently, there is a specific name for every new 
regression type.

There is a better alternative, and this alternative is already 
implemented in the S+ language and in the open source R program. It 
basically allows the user to specify the formula for the regression.

There are basically 3 regression models:

A.) Linear regression
- formulas of type: y = intercept + a1 * X1 + a2 * X2 + a3 * X3 + ...
- as seen, ODF doesn't permit a multivariate formula either,
i.e. X1, X2, X3, ... are different variables

B.) Generalized linear models
- formulas differ slightly, but in the case of a logistic regression:
p(y) = 1 / (1 + 1/exp(intercept + a1 * X1 + a2 * X2 + a3 * X3 + ...) )
where y is a binary variable and p(y) the probability of y

C.) Non-Linear models
- this is the most interesting
- it allows specifying the formula for the regression
- e.g. lets say we want to determine the coefficients a & b for:
a * x / (x*x + b)
in R, this looks like:
model.nls <- nls( y ~ a*x / (x*x + b), start=list(a=1, b=1))
where y is the outcome and x is the variable

As a practical example:
[You can copy / paste this in R]
x <- rnorm(1000) # generate 1,000 random numbers
y <- rnorm(1000) + rnorm(1) * x / (x*x+1)
x.nls<-nls(y~a*x / (x*x+b*x+c), start=list(a=1,b=0,c=1))

> Formula: y ~ a * x/(x * x + b * x + c)
> Parameters:
> Estimate Std. Error t value Pr(>|t|)
> a -1.83005 0.29605 -6.182 9.24e-10 ***
> b -0.04774 0.14024 -0.340 0.734
> c 1.40155 0.35845 3.910 9.85e-05 ***
> ---
> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

We see, "b" is statistically non-significant and we can remove it from 
the model (giving us then a * x / (x*x + c); we can rerun the regression 
using this formula to obtain a better result ).

The idea is:
I want a mechanism to specify the formula used in the regression. 
Instead of storing a formula name, it would be wiser to store the 
formula itself. This way, one can easily build *complex models* and 
*multivariate models* (more than one variable). This is currently 
not-possible and ODF lags behind professional packages in every respect 
(well, Excel fares poor in this respect, too, but then you shouldn't 
look at Excel when doing regressions).



Laurent BALLAND-POIRIER wrote:
> Dear TC Members,
> Please find enclosed a file format modification demand that Ingrid
> Halama and me wrote. It is about regression curves in spreadsheet. Some
> data are missing in ODF to get compatibility with other spreadsheets
> such as MS-Excel or Gnumeric. Numerous issues will not be solved till
> these data can not be saved.
> I hope I post in the right place. If not, please explain where to send
> this demand.
> Best regards,
> Laurent BP

