office-comment message

Subject: Re: [office-comment] DISTINCT Values
From: Leonard Mada <discoleo@gmx.net>
To: office-comment@lists.oasis-open.org
Date: Sat, 09 Jun 2007 03:15:43 +0300
Hi,

Patrick Durusau wrote:
> Leonard,
>
> Another interesting comment!
>
> Hmmm, but isn't it true that what OpenDocument should define are the 
> operators that work with DISTINCT (and other functions) such that you 
> can write your own functions?
>
> What concerns me is that no matter how large the eventual list, we are 
> going to miss that special function that someone has to have. If we 
> give them the ability to use standard operators with which to write 
> their specialized functions, that lessens the load on us and gives 
> users greater freedom.

Unfortunately, I have a very different and justified opinion. Maybe I 
have some better understanding of spreadsheet use than many people on 
this list as I have both
    extensively used  spreadsheets myself for working
    overlooked 100+ employees working extensively with spreadsheets
    extensively used  spreadsheets in research settings

I experienced frequent ERRORS and COMMON use cases.

One of the most  frequent uses for spreadsheets could be described as 
follows:

We have various items (wich can be repeated) and for each item we have 
some characteristics. e.g.
 a sales items - a location/supermarket - number of sales
 a different sales item - some other location - number of sales
 the same item - ...
 ...

1.) So, we may want to know how many items were sold in every location, 
what were the global revenues per item, ... . Indeed, spreadsheets are 
still frequently used in such settings.

2.) A different example:
We may collect the drug prescriptions over 12 month:
patient ID - drug prescribed - amount - unit price - doctor prescribing
Of course, any patient may take more than one type of drug and even the 
same drug can be bought 2 or more times. So, me may want to analyse:
 - what are the costs per patient: aka  SUM(for every group of 
DISTINCT(patient_ID); unit_price * amount)
 - how many prescriptions were for each drug, aka:
   COUNT( for every group of DISTINCT(drug_prescribed); amount)
 - cost per drug: which are the most expensive drugs?
    SUM( for every group of DISTINCT(drug_prescribed); unit_price * amount)
 - which doctors prescribe most of the drugs; ...

3.) A research example
age - various physiologic parameters - disease - outcome
 - effect of age on various parameters
   stratify on AGE and compute the desired parameters in these age 
groups, e.g.
    <10 yr, 10-19, 20-30, 30-40, ... => quite time consuming IF not done 
automatically
 (see the DISTINCT( 'cell_range', AS.TEXT = FALSE, 'cell_range2') 
example in my first post)

Indeed, many of these analysis could be done using databases, but much 
of this work is still done in spreadsheets. Spreadsheets have the 
advantage that the structure is quite loose, allowing greater 
flexibility. Also, one has greater overview over the data, and can 
better decide what analysis is warranted.

Indeed. much of my recent work involves such auditing. DISTINCT() per se 
was NOT the primary reason I need it. It is this extension:
 1.) to group data based on some parameter
 2.) compute various other values for every such group
(Although I occasionally need to know the number of factors as well.)

One can view this as various *vector operations*, where one of the 
parameters (or even more parameters) act as *categorizing factors*, 
splitting the population into various groups of data (similarly to the R 
as.factor() / http://cran.R-project.org). One wishes to analyse 
individually these groups, BUT as there may be 1000+ groups, sorting and 
doing ALL these operations individually is really NOT possible. Trust 
me, I work with datasets of 100,000+ records (even substantially more). 
And this is quite a common task (and will remain so in the foreseeable 
future). When using more than one factor, spreadsheets break completely.

There are various custom solutions, BUT shouldn't the spreadsheet 
conquer this field back? I haven't seen any major benefit of custom 
solutions. They are way to inflexible, and much too expensive.

Sincerely,

Leonard Mada

> Hope you are having a great day!
>
> Patrick
>
> Leonard Mada wrote:
>
>> I strongly miss a function to return the number of DISTINCT values 
>> existent in a given cell range. To my knowledge, this functionality 
>> is missing in every spreadsheet application, although most of the 
>> research involves such analysis.
>>
>> I would be further interested  to perform some operations using these 
>> distinct values.
>>
>> Functions:
>> DISTINCT( 'cell_range', AS.TEXT = TRUE, IGNORE.CASE = TRUE)
>> returns number of distinct strings in the cell range
>>
>>
>> DISTINCT( 'cell_range', AS.TEXT = FALSE, TOLERANCE = 0)
>> returns number of different values
>> DISTINCT( 'cell_range', AS.TEXT = FALSE, TOLERANCE = 0.5, ORDER = 
>> "ASCENDING")
>> returns number of different values; values within TOLERANCE are 
>> considered EQUAL
>>  - values are ranked first using the specified ORDER
>>  - IF( x[i] is within TOLERANCE of x[i-1]), the 2 values are 
>> considered equal
>> DISTINCT( 'cell_range', AS.TEXT = FALSE, TOLERANCE = '5%', ORDER = 
>> "ASCENDING")
>> returns number of different values; values within TOLERANCE are 
>> considered EQUAL
>>  - values are ranked first using the specified ORDER
>>  - IF( x[i] is within TOLERANCE of x[i-1]), the 2 values are 
>> considered equal
>>  - the absolute value for TOLERANCE is computed as x[i-1] * TOLERANCE
>>
>>
>> DISTINCT( 'cell_range', AS.TEXT = FALSE, 'cell_range2')
>> returns number of different values within 'cell_range'
>>  - 'cell_range2' describes the bounds/intervals used for splitting 
>> the initial data (i.e. for splitting 'cell_range')
>>  - the values within 'cell_range2' are ranked
>>  - IF any value from 'cell_range' < MIN('cell_range2') => this is 
>> first group
>>  - any value from 'cell_range' is within
>>        RANKED(cell_range2)[1]
>>        RANKED(cell_range2)[2] => next group
>>  - ...
>>  - any value from 'cell_range' > MAX('cell_range2') => this is the 
>> last group
>>
>> Of course, DISTINCT() is just one aspect of the analysis. Actually, I 
>> am more interested in doing specific calculations based on these 
>> distinct groups (like SUM(), COUNT(), ...). I will describe such 
>> calculations in a later post.
>>
>> Sincerely,
>>
>> Leonard Mada
Follow-Ups:
- Re: [office-comment] DISTINCT Values
  - From: Patrick Durusau <patrick@durusau.net>
References:
- DISTINCT Values
  - From: Leonard Mada <discoleo@gmx.net>
- Re: [office-comment] DISTINCT Values
  - From: Patrick Durusau <patrick@durusau.net>