office-comment message

Subject: Re: [office-comment] Statistical SUMMARY() function
From: Leonard Mada <discoleo@gmx.net>
To: office-comment@lists.oasis-open.org
Date: Wed, 04 Jun 2008 19:44:58 +0300
Hello Robert, dear list members,


robert_weir@us.ibm.com wrote:
> HI Leonard,
>
> Thanks for the proposal.
>
> Another way of accomplish this might be something in the middle.  So, 
> imagine a application MENU choice that instead of calculating the values 
> and pasting them in as static values -- an approach with the liabilities 
> you mention -- have the Menu option paste in the underlying formulas.  So 
> a menu option that pastes in a range/block of formulas.

Unfortunately, I did thought about this feature, BUT there are some 
major flaws to this design and I opt now strongly against it:
1.) user does not have always an oversight what did happen
     e.g. 2-3 month later, or half a year later
     - most often, you need the data some time after the analysis proper,
       when publishing or preparing a presentation
     - having a single =SUMMARY() array function is way easier
        to keep track off

2.) PUBLISHING
Now, the main reason I opt for the function is the fact, that separate 
results are seldom useful:
Why do I have to copy/paste single values when I always need a pair of 
values and the accompanying text?

e.g. when publishing in a scientific journal, you usually write:
   mean (SD) 120 (7.34)
  [I hate the mean +- SD notation, backing other famous statisticians 
with my request]
   or
   median (range) 75 (25 - 100)
  [where range actually points to min and max]

Similarly, during a presentation, single values make NO sense.

So, the cell will contain 3 computed values. This is difficult with 
existing functions.

3.) COMPUTING EFFICIENCY
As I mentioned in the post on extending the PERCENTILE() function, a 
composite function will be significantly much faster than separate 
functions. With increasing datasets, and some of the robust methods (I 
did include in the extended analysis), this becomes a problem even on 
modern computers.

e.g. the most accurate algorithm to compute the SD is a 2-pass algorithm 
(though fast and robust one-pass algorithms do exist). With the 
composite function, you already compute the mean, so computing the SD 
does not add any further penalty. Similarly, for computing the min, max, 
median, IQR and IPR95%, you would sort only once (when using the 
sort-based algorithm), and have all the values at once. It really does 
make sense to have them all computed at once.

Did I mention, that professional statistical packages (like R, EpiInfo, 
SPSS, ...) have this functionality built in (what is lacking is the 
publishing extension - I added this flavour).

4.) TEACHING PEOPLE vs CORRECT RESULTS
As stated, the function will combine a parameter of central tendency 
(median or mean) with the appropriate parameter of spread. No need for 
teaching the user what to use. [Such teaching attempts are most often 
elusive, as I can personally witness.] The user will just copy/paste the 
formatted results, nothing more and nothing less. [Hopefully, people 
will start learning by seeing the correct technique, though,a s long as 
it works even if they do not get it, it is acceptable.]

I hope this makes it clear why I prefer the integrated function solution.

Many thanks,

Leonard

P.S. some small remarks are attached below

> This approach 
> might be applicable in other places as well: a formula block template, 
> where you have a group of related functions that you want to paste in at 
> once, but have a single dialog box for filling out the common parameters. 
> This would make it more of an application UI feature than a format 
> question.
>
> In general, OpenFormula 1.0  is not trying to extend the conventional 
> spreadsheet formula model used by Excel, etc.

Which is bad. Than what is the novelty OpenFormula will bring?

> We're trying to create 
> something that is well-documented and compatible.

Have you seen Apple's Numbers?
It breaks some new grounds by separating the Canvas from the spreadsheet 
model. I sense some moving toward multidimensional spreadsheets. Don't 
mind, but if Apple really implements a multidimensional model in the 
next release of Numbers, than OpenFormula is pretty much obsolete.

By the way, people don't care about formulas. they want to solve a 
problem. This is something I learned during a lot of auditing.

They might solve the problem using some formulas. But this is usually 
*much harder* than it should be. People should solve their problems 
without much effort, without many formulas and thinking and 
interactions. It should function by the click of a button.

Now, existing spreadsheets are far from this goal, very far away. I like 
Apple's approach, and I am sure we will hear more about Aplle in the 
future (and about MS, too).

> However, once we have 
> that core, I am quite open to the idea of extending it in the future to 
> create more specialized functionality. 
>
> My preference would be to treat this holistically.  For example, we take 
> it as a task to improve ODF for use in engineering.  We create a 
> subcommittee to investigate this topic, talk with engineers, find out what 
> is needed, etc., and the results come back that spreadsheet needs some 
> more formulas, more chart types, and presentations need the ability to use 
> mathematical notation.  There could be similar efforts for medical, 
> academic, financial, etc.  But it will work best where we an alignment of 
> interested experts willing to help specify the requirements, as well as 
> vendors or open source projects willing to implement these features.
>
> -Rob
>
> Leonard Mada <discoleo@gmx.net> wrote on 06/02/2008 06:42:36 PM:
>
>   
>> Dear list members,
>>
>> One of the most used features in spreadsheets is to analyze and 
>> summarize the data. This is used in academia and businesses alike.
>>
>> In both instances, spreadsheet users often need to perform an 
>> exploratory data analysis and in one of the first steps they will 
>> summarize their data.
>>
>> I discussed this issue more fully at:
>> http://www.openoffice.org/issues/show_bug.cgi?id=89980
>>
>> While Calc (and ODF) has (most) functions implemented as separate 
>> funcitons, it lacks a unified function to compute all parameters at 
>> once. MS Excel does come with such a 'Menu'-entry in the advanced 
>> statistical toolpack.
>>
>>
>> A.) MENU vs FUNCTION
>> ====================
>> Excel did implement a menu-command. Unfortunately, IF the data-changes, 
>> the computed statistics won't change and need to be manually updated. 
>> The user might not notice that the data has changed.
>>
>> Therefore I opt strongly against the Menu-command and for a 
>>     
> proper-function.
>   
>> Also, using menu commands is not transparent, especially IF the 
>> spreadsheet will be later used by someone else or if the user revisits 
>> the spreadsheet a couple of months later.
>>
>>
>> B.) COMPARATIVE SOFTWARE ANALYSIS
>> =================================
>>
>> 1.) MS Excel: does have a SUMMARY-Menu command in the advanced
>>     statistical package (but no separate function)
>>
>> 2.) professional statistical programs:
>>     e.g. R (http://cran.R-project.org) has a summary() function
>>     [and some other summary-type functions, like fivenum()]
>>
>> C.) UTILITY
>> ===========
>> Both academia (research) and businesses alike would benefit from such a 
>> feature.
>>
>> Users will be able to easily get all the relevant summaries. Summaries 
>> will be formatted as for publishing / presentation (see options later).
>>
>> D.) REASONING
>> =============
>> Despite my desire to teach people correctly summarizing the data, and 
>> compute the descriptive statistics, I have to admit that this is an 
>> illusive goal. It is much easier to have a function that computes 
>> everything automatically without any user-interaction, than hope for ALL 
>>     
>
>   
>> these people learning descriptive statistics.
>>
>>
>> D.) IMPLEMENTATION
>> ==================
>>
>> 1.) Function-Name: obviously: SUMMARY()
>>
>> 2.) Parameters: (range; option)
>>   - range: the data range in the spreadsheet
>>   - option:
>>      -- default: "BRIEF" OR 0: compute most important summaries
>>      -- "DETAILED" OR 1: compute advanced summaries
>>      -- "PUBLICATION" OR 2: compute summaries and format as for
>>          publication in a professional journal
>>      -- "TABLE" OR 4: compute summaries as for publication,
>>          BUT display them in tabular format (for inclusion in a table)
>>      -- some other: ... lets leave the feature-list open
>>   - results:
>>      -- an array with the important summaries
>>
>> IMPORTANT: This is an array function!
>>
>> REASONING for Options:
>>  - most of the time, users need only some basic descriptive statistics.
>>  - I am against displaying ALL possible parameters, when most of them
>>    are of NO use.
>>   [this is descriptive statistics]
>>  - the 'DETAILED' option will print more advanced statistics.
>>  - the 'PUBLICATION' OR 'PRINT'-option will additionally format the 
>>     
> results
>   
>>    as for print, so that the user can simply COPY/PASTE the results
>>  - the 'TABLE' is as the 'PUBLICATION', but in tabular format
>>
>> 3.) COMPUTED OUTPUT
>> ===================
>>  a.) the function shall verify IF the data is:
>>   -- numerical non-binary
>>   -- binary (numerical or categorical)
>>   -- categorical, NON-binary
>>   -- obviously, it can't detect ordinal values
>>
>>  b.) it should verify IF the first row in the series is a LABEL
>>   -- it should skip any eventual labels
>>
>>  c.) computed values should be:
>>
>>  NUMERICAL DATA
>>    BRIEF DESCRIPTIVE STATISTICS
>>      min
>>      first quartile
>>      median
>>      third quartile
>>      max
>>      mean
>>      standard deviation
>>      median absolute deviation of the median (MAD)
>>       -- has the best possible breakdown point
>>     (I deliberately omit the mode; NO strong opinion about variance)
>>
>>    ADVANCED DESCRIPTIVE STATISTICS
>>      ALL BRIEF PLUS additionally
>>      variance
>>      mode (well, lets include it)
>>      standard error
>>      CV (coefficient of variation)
>>      percentile 2.5% and 97.5%
>>      [for the IPR95%]
>>      mean absolute deviation of the mean (ADM - or average deviation)
>>      [although this is better known as the previously mentioned MAD, it 
>>     
> is
>   
>>       actually a much less robust estimator than the MAD]
>>      very robust locationless estimators:
>>        Sn = 1.1926 * MEDi{MEDj|Xi-Xj|)
>>        Qn = QUANTILE( |Xi-Xj|; 0.25), where i<j
>>     [see http://www.jstor.org/pss/2291267 for details]
>>
>>    PRINT
>>     - add text to the same cells and
>>     - group relevant parameters together
>>     - to allow easy COPY/PASTE:
>>     *median (range)* 'median (min-max)'
>>          -- the '...' are the actual values for these parameters
>>          -- the *...* are strings
>>          -- the range, IQR and IPR95% refer to the 2 separate values,
>>              NOT the difference
>>     *median (interquartile range)* 'value (value - value)'
>>     *median (IPR95%)* 'value (value - value)'
>>     *mean (SD)* 'value (value)'
>>
>>     - both the TEXT and the values should be provided in the same cell,
>>       so as to allow easy COPY/PASTE
>>     - both 3 versions of the MEDIAN (...) should be provided
>>       [they serve different purposes and range / IQR / IPR95% are always
>>        grouped with the median]
>>     - I prefer 'mean (SD)' over mean +- SD, because the mean +- SD makes
>>       actually NO sense; it is mean +- 2*SD that makes sense
>>
>>    TABLE:
>>     - as PRINT, BUT display the names/text-strings in the first column
>>     - and the values in the next column
>>     - to easily COPY/PASTE as a table
>>
>> I will discuss the NON-numerical details in a later post.
>>
>> Sincerely,
>>
>> Leonard
>>
>> P.S. I hope to get some time later this week to rephrase the BINOMDIST() 
>>     
>
>   
>> description. I did not forget about it, I was just too busy.
>>
>> -- 
>> This publicly archived list offers a means to provide input to the
>> OASIS Open Document Format for Office Applications (OpenDocument) TC.
>>
>> In order to verify user consent to the Feedback License terms and
>> to minimize spam in the list archive, subscription is required
>> before posting.
>>
>> Subscribe: office-comment-subscribe@lists.oasis-open.org
>> Unsubscribe: office-comment-unsubscribe@lists.oasis-open.org
>> List help: office-comment-help@lists.oasis-open.org
>> List archive: http://lists.oasis-open.org/archives/office-comment/
>> Feedback License: http://www.oasis-open.org/who/ipr/feedback_license.pdf
>> List Guidelines: http://www.oasis-open.org/maillists/guidelines.php
>> Committee: 
>>     
> http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=office
>   
>
>
References:
- Statistical SUMMARY() function
  - From: Leonard Mada <discoleo@gmx.net>
- Re: [office-comment] Statistical SUMMARY() function
  - From: robert_weir@us.ibm.com