office-comment message

Subject: Re: [office-comment] DISTINCT Values

From: Leonard Mada <discoleo@gmx.net>
To: office-comment@lists.oasis-open.org
Date: Sat, 09 Jun 2007 22:39:15 +0300

Hi Patrick,

Patrick Durusau wrote:
> ...
>
> What we need are details. Use case scenarios are useful but only up to 
> a point.
>
> For example, you mention the R as factor () below.

Maybe I was not able to clearly explain what I meant.

In simple words, I wanted an *enhanced function corresponding* to the 
*pivot tables*. Well, maybe it is now easier to understand. Current 
implementations of pivot tables seem quite weak to me. And they are NOT 
functions. I therefore do want:
 - something more advanced
 - easily expandable / flexible
 - and defined as spreadsheet functions

The function DISTINCT() was meant as the first step in this process. 
This would generate the groups of data / make the categories. Indeed, 
these categories would behave like factors (in R, and generally in 
statistics, these are called factors - respectively levels of a 
variable). Further functions should have followed, which would generate  
the various  reports (these would imply extensive vector operations). 
Indeed, factors are extensively used in vector/matrix operations.

> Recalling that OpenDocument is an *interchange* format, how do we deal 
> with the following issue?
>
>> Factors are currently implemented using an integer array to specify 
>> the actual levels and a second array of names that are mapped to the 
>> integers. Rather unfortunately users often make use of the 
>> implementation in order to make some calculations easier. This, 
>> however, is an implementation issue and is not guaranteed to hold in 
>> all implementations of R. (Section 2.3.1 Factors, R Definition Language)
## THIS IS A SIDE NOTE
 - the previous WARNING is irrelevant both to ODF and to R-users that 
stick to the S+ standard
 - for someone working with factors, it is irrelevant how factors are  
*INTERNALLY* stored in R
 - 'is.factor()' will ALWAYS return TRUE for a factor-object
    irrespective of its internal storage ('as.factor()' interprets 
something as a factor)
 - internally (in R), factors are currently stored in a way that uses 
integers
    - THIS data structure should however NEVER be known nor assumed by 
users, and
       therefore, it should NEVER be used (as open-source, of course you 
can get the details)
    - these are hidden methods
      (thats why you declare 'private' and 'protected' in C++ classes, 
to hide the implementation)
    - however, obviously, there are users who make use of this
      and even worse, perform mathematical calculations with factors
     (it makes NO sense to compare mathematically a level "A" with a 
level "B",
      or with an integer, BUT some do exactly that)

## END SIDE NOTE

> We do not specify implementation details so it is possible for an "as 
> factor()" function to work differently depending upon implementation 
> details.
>
> Having a function defined by a standard work differently is a bad thing.
## SIDE NOTE
  - 'is.factor()' and 'as.factor()' WILL work as expected in R even in 
the future
  - users who interpret this result as an integer are affected, and I 
fully support this idea,
    they should have never supposed those factors to be stored as integers
  - *A factor may be purely nominal or may have ordered categories*!!!
     NO mention of integers.
## END SIDE NOTE

CONCLUSIONS
============
Indeed, spreadsheets should have functions that perform assignment of 
some data into *categories*. DISTINCT() was supposed to do so. These 
categories would then behave like the described factors. Pivot Tables 
(aka Data Tables) do currently similar things, though I wanted something 
more advanced. And I wanted a function.

Hope this explanation clarifies some of the issues.

Sincerely,

Leonard

> I don't know whether that would actually change the result of a 
> function or not but it is an example of the level of detail that is 
> necessary to consider when defining a function in a standard.
>
> I suspect it would be possible to define "as factor()" such that it 
> had a standardized result and if someone allowed used based on 
> implementation details they would be non-conformant. I say that not 
> having looked at the details. And by details I do not mean use or test 
> cases but a formal definition of the function.
>
> I know the formula SC has a number of functions that still need some 
> work so maybe we need a rule that welcomes new function proposals but 
> grants priority to requests accompanied by work on functions already 
> accepted for standardization.
>
> David, what say you?
>
> Hope you are having a great weekend!
>
> Patrick

References:
- DISTINCT Values
  - From: Leonard Mada <discoleo@gmx.net>
- Re: [office-comment] DISTINCT Values
  - From: Patrick Durusau <patrick@durusau.net>
- Re: [office-comment] DISTINCT Values
  - From: Leonard Mada <discoleo@gmx.net>
- Re: [office-comment] DISTINCT Values
  - From: Patrick Durusau <patrick@durusau.net>