office-comment message

Subject: Re: [office-comment] Extending the PERCENTILE() Function
From: Leonard Mada <discoleo@gmx.net>
To: office-comment@lists.oasis-open.org
Date: Wed, 04 Jun 2008 21:41:54 +0300
robert_weir@us.ibm.com wrote:
> Leonard Mada <discoleo@gmx.net> wrote on 06/03/2008 04:49:44 PM:
>
>   
>> Dear list members,
>>
>> I propose the following extension to the percentile function:
>>
>> Syntax: PERCENTILE( NumberSequence Data ; Number x )
>> Syntax: PERCENTILE( NumberSequence Data ; NumberSequence x )
>>
>> where x:
>>     is a number between 0 and 1, or
>>     a sequence of numbers, all between 0 and 1.
>>
>>     
>
> Hi Leonard,  just so I'm clear, the extension is the 2nd form of 
> PERCENTILE you list, right?  It would presumably be a vector function.
>   

Yes, it is the 2nd and it is a vector (array) function. Forgot to say 
that. [well, it was quite late in the evening, 03 Jun 2008 23:49:44 
+0300, my apologies for the omission.]

> An unrelated question for you, since you've obviously spent some time 
> thinking of spreadsheet performance -- Now that we see increased numbers 
> of multi-core processors on desktop machines, what do you estimate would 
> be the best way of taking advantage of this?
>   

The major problem of existing spreadsheets is that they allow spaghetti 
code: one cell references another referencing another ..., so you have 
complex dependencies (sometimes even circular references, I'll come 
during a later e-mail back to this), which are not easy to parallelize. 
At least, the burden on the programmer will be quite high.

> It seems that in some cases you'll have a spreadsheet with 1000's of 
> formulas, but none of them especially slower than the others.  In that 
> case, a parallelizable calculation chain will split the load into two 
> threads.  But in some cases the calculation time might be dominated by 
> just one or two cells.  In that case, a parallel implementation of a 
> particular function would give the greatest benefit.
>   

In those cases, this is true. But a generic solution is not easy, 
because users might combine references in the weirdest ways, and 
penalties may arise from numerous unnecessary recalculations, not 
necessary from the slowest function. In my example, it is probably the 
6.6 million unnecessary sort operations or 101 sort operations 
(depending on the algorithm used), where one single sort would have 
sufficed. Some other factors are probably at work, too.

The advantage of one function instead of dozens - doing, when combined, 
the same job - is probably less obvious. I will explain:
 - IF a cell gets modified, it will broadcast a message to cells
   that are dependent on it (that are referencing it), to recalculate

 - lets move now to such a cell:
   -- a good implementation will now look IF this cell is
       dependent on other cells that have yet to be recalculated
   -- IF there are such cells, it will wait for those cells to be
       first recalculated, and only then it will recalculate

 - using a compound function instead of multiple sub-functions has
    a clear advantage:
   -- it will wait for all calculations that it depends on, to be done
   -- and updates only afterwards

 - with multiple sub-functions, some will recalculate, but later on,
    they will need to recalculate again
   [we can't predict IF cells the function depends on,
     will be modified in the future, by one of the other recalculations]

With a parallel design you will most likely end up with cells waiting 
for other cells to be recalculated, and/or various cells get 
recalculated multiple times (instead of once). Though, some gain can be 
obtained. But for a consistent gain, it is way better to address some 
spreadsheet design issues first.

[Just as a side note: the MEDIAN and PERCENTILE belong to the group of 
slower functions. LOOKUP and OFFSET functions might belong to this 
group, too. However, it is probably not a single function that causes 
the trouble, but the update that triggers recalculating various cells 
dozens-thousands-millions of time.]

> In other words, I wonder whether there is a significant boost in 
> calculation performance we can get by just using the current generation of 
> hardware better.
>   

One hardware comes to mind that would perform significantly better. 
Vector processors have a significant potential, but to my knowledge the 
chances to get one into a desktop system are fairly dim. Again, a 
structured spreadsheet is likely to compute faster than a free-style 
generated one.

Sincerely,

Leonard

> (and I wonder if there are some spreadsheet functions that can be 
> off-loaded to a graphics card...)
>   

Calculations done by the graphic card are - as far as I know - 
inaccurate (definitely less accurate than needed in spreadsheets). If 
you need accurate calculations, than you end up with a hyper-expensive 
intended-for-CAD graphics card.

> -Rob
>
References:
- Extending the PERCENTILE() Function
  - From: Leonard Mada <discoleo@gmx.net>
- Re: [office-comment] Extending the PERCENTILE() Function
  - From: robert_weir@us.ibm.com