office-comment message

Subject: Re: Data Grid Size element proposal
From: Leonard Mada <discoleo@gmx.net>
To: office-comment@lists.oasis-open.org, "Andreas J. Guelzow" <aguelzow@math.concordia.ab.ca>
Date: Sat, 29 Nov 2008 03:09:25 +0200
Dear Andreas,

I will follow 2 approaches to explain this again and add an insight into 
the future:

1.) Reductio ad absurdum
If the application knows how big the spreadsheets is, then references 
outside this range are either errors or wrap arounds for names (due to 
poor design both for the spreadsheet and current ODF - but still can be 
handled because we know the size of the spreadsheet). Thus I do not need 
the MAX_GRID_SIZE.
qed

2.) But even considering the poor ODF-design, the spreadsheet size is a 
far better parameter where to wrap around: obviously at the boundaries 
of the spreadsheet, not at an abstract MAX_GRID_SIZE. And I add my 
suggestion to raise an error if cells get referenced outside this range. 
A significant number of errors would be caught. This is an option that 
actually structures the spreadsheet.

[Of course, it becomes necessary to define the actual spreadsheet size. 
Simply put: A1: up to the column and the row that contain some data or 
some formula, or an array formula extending up to there. Strictly 
speaking, an image outside this area is NOT extending the spreadsheet up 
to there. Neither does a reference to an empty cell outside this area 
extend the spreadsheet.]

3.) I do agree that more often than not, the application can determine 
the size of the spreadsheet. But I would not bet that this is well 
defined in every case. It is able to compute this size precisely and 
unmistakenly, only IF the spreadsheet was created in the same session 
and formulas were added sequentially. [In the above example, the 
spreadsheet size would go up to column IV, and therefore it would wrap 
around to cover IU1 and IV1, without any need to MAX_GRID_SIZE. If IV1 
was undefined, than most certainly that would be an error, and I want it 
to be reported as such, not silently wrap around at an application 
defined limit. How often did anyone on this list need exactly 256 
columns? I did never use or see a spreadsheet with exactly 256 columns! 
This reminds me of another issue.]

This specified size becomes more relevant with array calculations, and I 
hope that spreadsheets move that way. Also, as of currently ODF does NOT 
support whole column / rows references - but this may well change in the 
future.

Lets say, we write the following array formula inside cell A1:
{ = SOME_ARRAY_FUNCTION(B:B, C:C) }

where SOME_ARRAY_FUNCTION will compute an array of same size as its input.

If we had first the data within columns B and C, then that function 
would cover only as much data is in B and C, and would NOT make any 
sense beyond that range. However, IF a user opens this spreadsheet, it 
becomes almost impossible to determine the true size of the spreadsheet 
(while in this trivial example it may be trivial, a fairly complex 
spreadsheet is unlikely to yield a correct size). This is the reason why 
the true size of the spreadsheet needs to be tracked. [As I said, I hope 
to see more array functionality in the future, and tracking the 
spreadsheet size won't be trivial any more.]

I still could create spreadsheets without having actually any data written.
Lets say A and B are empty, and we write in column C: { = 
SOME_ARRAY_FUNCTION(A:A; B:B) }

Obviously, the spreadsheet size is 1 row and 3 columns, because the 
range is delimited by cell C1. Writing the next formula will also give a 
completely defined state. If we wrote something within columns A or B, 
then the rows would be extended up to that point. If we inputted some 
data outside of columns A, B and C, then the spreadsheet would extend up 
to that cell. Therefore it is perfectly defined. Every new input will 
redefine the size of the spreadsheet, and the program can easily track 
this size.

However, if the completed spreadsheet is re-opened, the size might be 
quite ambiguous and non-determinable (because we do not have the 
sequence the formulas were added).

This is relevant especially in the following high-end scenario: a 
professional spreadsheet designer creates the logic for a new 
spreadsheet, without having actually any data written in the 
spreadsheet. He will use array formulas that will automatically adapt 
for every new data entry. Array formulas are used so that the end-users 
will just enter the data. They will never ever have to update formulas. 
[I hope to see this one day.]

When the end user opens the final product, it might be very ambiguous 
what the size of the spreadsheet is. There are a lot of array formulas 
referencing each other, and every formula has an (as of yet) undefined size.

Also, if opening 1 million rows brings a severe penalty, I would want 
that my application allocates only as much columns/rows as really 
needed. But I want it to be still able to open the relatively rare case 
with 1 million rows without iterating twice through the document.

I hope this sheds more light on this issue.

Sincerely,

Leonard

P.S. some rants are inline


Andreas J. Guelzow wrote:
> On 11/28/2008, "Leonard Mada" <discoleo@gmx.net> wrote:
>
>   
>> Dear TC Members,
>>
>> while reading the discussion, I have to express my dismay to the
>> proposal and more broadly to the various spreadsheet concepts. It seems
>> little has been learned from past mistakes.
>>
>> Warren Turkal wrote:
>> [...]
>>     
>>> This doesn't seem useful to me. If I never address anything in those
>>> cells that are outside my range, I'd say it's pretty likely I don't
>>> care about them, and just knowing the max row/col size of the saving
>>> application doesn't tell you that.
>>>       
>
>
> That's not simply not true: If I know the grid size of teh saving
> application I can determine whether ever any cell outside the grid was
> accessed or the "wrapping" feature was used.
>
> If I don't know the grid size I cannot do that.
>   

Well, of course you can. If you know the size of the spreadsheet, than 
you can't access anything beyond those boundaries. [Even when accepting 
the flawed wrap-around concept. Names referencing something outside this 
range would be wrapped - but I hope this gets deprecated. But even in 
this case, spreadsheet size is a more robust way to handle wrapping.]

>> [see http://lists.oasis-open.org/archives/office/200811/msg00138.html]
>>
>> I must strongly back up Warren on this one.
>>
>> Lets say it this way: suppose I have a spreadsheet program WITHOUT any
>> grid-size limit (because it implements a very clever iterative mechanism
>> - making it virtually infinitely wide). I create a 2 columns by 2 rows
>> spreadsheet. BUT this one is virtually non-openable in any other
>> program! 
>>     
>
> Why would it not be openable?
>   

Well, because GRID_SIZE = Infinity, and I wish to quote one of your posts:
> Of course we could tell the user that "hey there are formulas that
> appear to require 319 columns and 1231124 rows" after we have parsed the
> whole file and all of its formulas. I believe it would be desirable to
> tell the user that the sheet was created in a program with 400 columns
> and 2000000 rows so that there might be references or content outside
> the range before we start reading such a big file in its entirety.
Therefore, I believe that every new Excel file will raise this message. 
Considering the widespread use of Excel, this is a bad omen for the 
alternative spreadsheets, at least from a UX-perspective. Users hate 
such annoying messages quite a lot.

And IF you still open the file, then WHY the message in the first place? 
You could post it after opening the file anyway. Again, actual 
spreadsheet size is a far better approach: it gives the actual size, and 
so the application HAS to support this size in order to open the 
spreadsheet.

>> This is again wrong logic. The application does NOT want to know the
>> MAX_GRID_SIZE, BUT the *actual spreadsheet size*!
>>     
>
> The application can easily figure out the actual spread sheet size. It
> cannot figure out the maximum size which may affect the evaluation of
> formulas.
>   

Do you often use memory references past the allocated memory when 
writing code for gnumeric?
I am just wondering: is it a good practice?

Should we follow this practice when coding spreadsheets? What is your 
recommendation?

I hope the questions can be answered by anyone on this list. As far as I 
know, everyone on this list has at least some coding experience.

For a more constructive discussion on spreadsheet size, see the top of 
this post. I presented evidence that formulas should be evaluated only 
within the defined spreadsheet size. Anything referencing a cell outside 
this size should raise an error. Wrapping (while bad), would be 
excellently covered by spreadsheet size, as it defines a *sensible* 
wrap-region, not an arbitrary one. And I gave evidence that spreadsheet 
size might not be trivial to determine, especially in the future when 
using array formulas.

>> So IF the TC is clever it will follow Apple's approach, and limit the
>> spreadsheet size to its actual size, and not limit the application.
>>     
>
> Nobody is suggesting to limit any application. Where do you get that idea?
>
>  Andreas Guelzow
>