[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]

*Subject*: **Re: [office-comment] DISTINCT Values**

*From*:**Leonard Mada <discoleo@gmx.net>***To*: office-comment@lists.oasis-open.org*Date*: Sat, 09 Jun 2007 03:15:43 +0300

Hi, Patrick Durusau wrote: > Leonard, > > Another interesting comment! > > Hmmm, but isn't it true that what OpenDocument should define are the > operators that work with DISTINCT (and other functions) such that you > can write your own functions? > > What concerns me is that no matter how large the eventual list, we are > going to miss that special function that someone has to have. If we > give them the ability to use standard operators with which to write > their specialized functions, that lessens the load on us and gives > users greater freedom. Unfortunately, I have a very different and justified opinion. Maybe I have some better understanding of spreadsheet use than many people on this list as I have both extensively used spreadsheets myself for working overlooked 100+ employees working extensively with spreadsheets extensively used spreadsheets in research settings I experienced frequent ERRORS and COMMON use cases. One of the most frequent uses for spreadsheets could be described as follows: We have various items (wich can be repeated) and for each item we have some characteristics. e.g. a sales items - a location/supermarket - number of sales a different sales item - some other location - number of sales the same item - ... ... 1.) So, we may want to know how many items were sold in every location, what were the global revenues per item, ... . Indeed, spreadsheets are still frequently used in such settings. 2.) A different example: We may collect the drug prescriptions over 12 month: patient ID - drug prescribed - amount - unit price - doctor prescribing Of course, any patient may take more than one type of drug and even the same drug can be bought 2 or more times. So, me may want to analyse: - what are the costs per patient: aka SUM(for every group of DISTINCT(patient_ID); unit_price * amount) - how many prescriptions were for each drug, aka: COUNT( for every group of DISTINCT(drug_prescribed); amount) - cost per drug: which are the most expensive drugs? SUM( for every group of DISTINCT(drug_prescribed); unit_price * amount) - which doctors prescribe most of the drugs; ... 3.) A research example age - various physiologic parameters - disease - outcome - effect of age on various parameters stratify on AGE and compute the desired parameters in these age groups, e.g. <10 yr, 10-19, 20-30, 30-40, ... => quite time consuming IF not done automatically (see the DISTINCT( 'cell_range', AS.TEXT = FALSE, 'cell_range2') example in my first post) Indeed, many of these analysis could be done using databases, but much of this work is still done in spreadsheets. Spreadsheets have the advantage that the structure is quite loose, allowing greater flexibility. Also, one has greater overview over the data, and can better decide what analysis is warranted. Indeed. much of my recent work involves such auditing. DISTINCT() per se was NOT the primary reason I need it. It is this extension: 1.) to group data based on some parameter 2.) compute various other values for every such group (Although I occasionally need to know the number of factors as well.) One can view this as various *vector operations*, where one of the parameters (or even more parameters) act as *categorizing factors*, splitting the population into various groups of data (similarly to the R as.factor() / http://cran.R-project.org). One wishes to analyse individually these groups, BUT as there may be 1000+ groups, sorting and doing ALL these operations individually is really NOT possible. Trust me, I work with datasets of 100,000+ records (even substantially more). And this is quite a common task (and will remain so in the foreseeable future). When using more than one factor, spreadsheets break completely. There are various custom solutions, BUT shouldn't the spreadsheet conquer this field back? I haven't seen any major benefit of custom solutions. They are way to inflexible, and much too expensive. Sincerely, Leonard Mada > Hope you are having a great day! > > Patrick > > Leonard Mada wrote: > >> I strongly miss a function to return the number of DISTINCT values >> existent in a given cell range. To my knowledge, this functionality >> is missing in every spreadsheet application, although most of the >> research involves such analysis. >> >> I would be further interested to perform some operations using these >> distinct values. >> >> Functions: >> DISTINCT( 'cell_range', AS.TEXT = TRUE, IGNORE.CASE = TRUE) >> returns number of distinct strings in the cell range >> >> >> DISTINCT( 'cell_range', AS.TEXT = FALSE, TOLERANCE = 0) >> returns number of different values >> DISTINCT( 'cell_range', AS.TEXT = FALSE, TOLERANCE = 0.5, ORDER = >> "ASCENDING") >> returns number of different values; values within TOLERANCE are >> considered EQUAL >> - values are ranked first using the specified ORDER >> - IF( x[i] is within TOLERANCE of x[i-1]), the 2 values are >> considered equal >> DISTINCT( 'cell_range', AS.TEXT = FALSE, TOLERANCE = '5%', ORDER = >> "ASCENDING") >> returns number of different values; values within TOLERANCE are >> considered EQUAL >> - values are ranked first using the specified ORDER >> - IF( x[i] is within TOLERANCE of x[i-1]), the 2 values are >> considered equal >> - the absolute value for TOLERANCE is computed as x[i-1] * TOLERANCE >> >> >> DISTINCT( 'cell_range', AS.TEXT = FALSE, 'cell_range2') >> returns number of different values within 'cell_range' >> - 'cell_range2' describes the bounds/intervals used for splitting >> the initial data (i.e. for splitting 'cell_range') >> - the values within 'cell_range2' are ranked >> - IF any value from 'cell_range' < MIN('cell_range2') => this is >> first group >> - any value from 'cell_range' is within >> RANKED(cell_range2)[1] >> RANKED(cell_range2)[2] => next group >> - ... >> - any value from 'cell_range' > MAX('cell_range2') => this is the >> last group >> >> Of course, DISTINCT() is just one aspect of the analysis. Actually, I >> am more interested in doing specific calculations based on these >> distinct groups (like SUM(), COUNT(), ...). I will describe such >> calculations in a later post. >> >> Sincerely, >> >> Leonard Mada

**Follow-Ups**:**Re: [office-comment] DISTINCT Values***From:*Patrick Durusau <patrick@durusau.net>

**References**:**DISTINCT Values***From:*Leonard Mada <discoleo@gmx.net>

**Re: [office-comment] DISTINCT Values***From:*Patrick Durusau <patrick@durusau.net>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]