office-comment message

Subject: Statistical SUMMARY() function
From: Leonard Mada <discoleo@gmx.net>
To: office-comment@lists.oasis-open.org
Date: Tue, 03 Jun 2008 01:42:36 +0300
Dear list members,

One of the most used features in spreadsheets is to analyze and 
summarize the data. This is used in academia and businesses alike.

In both instances, spreadsheet users often need to perform an 
exploratory data analysis and in one of the first steps they will 
summarize their data.

I discussed this issue more fully at:
http://www.openoffice.org/issues/show_bug.cgi?id=89980

While Calc (and ODF) has (most) functions implemented as separate 
funcitons, it lacks a unified function to compute all parameters at 
once. MS Excel does come with such a 'Menu'-entry in the advanced 
statistical toolpack.


A.) MENU vs FUNCTION
====================
Excel did implement a menu-command. Unfortunately, IF the data-changes, 
the computed statistics won't change and need to be manually updated. 
The user might not notice that the data has changed.

Therefore I opt strongly against the Menu-command and for a proper-function.

Also, using menu commands is not transparent, especially IF the 
spreadsheet will be later used by someone else or if the user revisits 
the spreadsheet a couple of months later.


B.) COMPARATIVE SOFTWARE ANALYSIS
=================================

1.) MS Excel: does have a SUMMARY-Menu command in the advanced
    statistical package (but no separate function)

2.) professional statistical programs:
    e.g. R (http://cran.R-project.org) has a summary() function
    [and some other summary-type functions, like fivenum()]

C.) UTILITY
===========
Both academia (research) and businesses alike would benefit from such a 
feature.

Users will be able to easily get all the relevant summaries. Summaries 
will be formatted as for publishing / presentation (see options later).

D.) REASONING
=============
Despite my desire to teach people correctly summarizing the data, and 
compute the descriptive statistics, I have to admit that this is an 
illusive goal. It is much easier to have a function that computes 
everything automatically without any user-interaction, than hope for ALL 
these people learning descriptive statistics.


D.) IMPLEMENTATION
==================

1.) Function-Name: obviously: SUMMARY()

2.) Parameters: (range; option)
  - range: the data range in the spreadsheet
  - option:
     -- default: "BRIEF" OR 0: compute most important summaries
     -- "DETAILED" OR 1: compute advanced summaries
     -- "PUBLICATION" OR 2: compute summaries and format as for
         publication in a professional journal
     -- "TABLE" OR 4: compute summaries as for publication,
         BUT display them in tabular format (for inclusion in a table)
     -- some other: ... lets leave the feature-list open
  - results:
     -- an array with the important summaries

IMPORTANT: This is an array function!

REASONING for Options:
 - most of the time, users need only some basic descriptive statistics.
 - I am against displaying ALL possible parameters, when most of them
   are of NO use.
  [this is descriptive statistics]
 - the 'DETAILED' option will print more advanced statistics.
 - the 'PUBLICATION' OR 'PRINT'-option will additionally format the results
   as for print, so that the user can simply COPY/PASTE the results
 - the 'TABLE' is as the 'PUBLICATION', but in tabular format

3.) COMPUTED OUTPUT
===================
 a.) the function shall verify IF the data is:
  -- numerical non-binary
  -- binary (numerical or categorical)
  -- categorical, NON-binary
  -- obviously, it can't detect ordinal values

 b.) it should verify IF the first row in the series is a LABEL
  -- it should skip any eventual labels

 c.) computed values should be:

 NUMERICAL DATA
   BRIEF DESCRIPTIVE STATISTICS
     min
     first quartile
     median
     third quartile
     max
     mean
     standard deviation
     median absolute deviation of the median (MAD)
      -- has the best possible breakdown point
    (I deliberately omit the mode; NO strong opinion about variance)

   ADVANCED DESCRIPTIVE STATISTICS
     ALL BRIEF PLUS additionally
     variance
     mode (well, lets include it)
     standard error
     CV (coefficient of variation)
     percentile 2.5% and 97.5%
     [for the IPR95%]
     mean absolute deviation of the mean (ADM - or average deviation)
     [although this is better known as the previously mentioned MAD, it is
      actually a much less robust estimator than the MAD]
     very robust locationless estimators:
       Sn = 1.1926 * MEDi{MEDj|Xi-Xj|)
       Qn = QUANTILE( |Xi-Xj|; 0.25), where i<j
    [see http://www.jstor.org/pss/2291267 for details]

   PRINT
    - add text to the same cells and
    - group relevant parameters together
    - to allow easy COPY/PASTE:
    *median (range)* 'median (min-max)'
         -- the '...' are the actual values for these parameters
         -- the *...* are strings
         -- the range, IQR and IPR95% refer to the 2 separate values,
             NOT the difference
    *median (interquartile range)* 'value (value - value)'
    *median (IPR95%)* 'value (value - value)'
    *mean (SD)* 'value (value)'

    - both the TEXT and the values should be provided in the same cell,
      so as to allow easy COPY/PASTE
    - both 3 versions of the MEDIAN (...) should be provided
      [they serve different purposes and range / IQR / IPR95% are always
       grouped with the median]
    - I prefer 'mean (SD)' over mean +- SD, because the mean +- SD makes
      actually NO sense; it is mean +- 2*SD that makes sense

   TABLE:
    - as PRINT, BUT display the names/text-strings in the first column
    - and the values in the next column
    - to easily COPY/PASTE as a table

I will discuss the NON-numerical details in a later post.

Sincerely,

Leonard

P.S. I hope to get some time later this week to rephrase the BINOMDIST() 
description. I did not forget about it, I was just too busy.
Follow-Ups:
- Re: [office-comment] Statistical SUMMARY() function
  - From: robert_weir@us.ibm.com
- Re: [office-comment] Statistical SUMMARY() function
  - From: "David A. Wheeler" <dwheeler@dwheeler.com>