office-comment message

Subject: [office-formula] SUMIF and behavior of empty cells in reference ranges

From: Steven Powell <Steve.Powell@springsource.com>
To: <office-comment@lists.oasis-open.org>
Date: Tue, 6 Jan 2009 16:27:41 +0000

Hi,

I have stumbled across the strange treatment of empty cells in the SUMIF function in Calc; and was referred to this document for the spec of this behavior. Reading the specification for SUMIF I think it is vague to the point of un-helpfulness.

-------------8<--------------------

6.15.61 SUMIF

Summary: Sum the values of cells in a range that meet a criteria.

Syntax: SUMIF( ReferenceList|Reference R ; Any C [ ; Reference S ] )

Returns: Number

Constraints: Does not accept constant values as the range parameter.

Semantics: Sums the values of type Number in the range R or S that meet the criteria C.

If S is not given, R may be a reference list. If S is given, R must not be a reference list with more than 1 references and an error be generated if it was.

The criteria C may be a number, an expression, or a text string. A text string starting with a comparison operator is interpreted as a comparison criteria.

If the optional range S is included, then the values of S starting from the top left cell and matching the geometry of R (same number of rows and columns) are summed if the corresponding value in R meets the criteria. The actual range S is not considered. If the resulting range exceeds the sheet bounds, column numbers larger than the maximum column and row numbers larger than the maximum row are silently ignored, no error is generated for this case.

-------------8<--------------------

SUMIF appears to be a special case compared to, say COUNTIF, since it has a parameter form that allows the criterion to be applied to different cells to those summed. This is the form when the first parameter is not a reference list, and the third parameter S is supplied. This means that commonly used phrases like:

A reference to an empty cell is interpreted as the numeric value 0.

(which occurs in SUMIFS following but, curiously, not in SUMIF) are actually ambiguous, since it doesn't say how the criteria range R is treated (if at all) differently from the Sum range S (the third parameter, if present). Moreover, suitable treatments of empty cells (or text cells) in range S might be to omit or not consider them -- but what happens if the criteria range R contains empty cells, and the criterion is a comparison. The specification is silent on this matter.

It is clear that the range shapes agree (or are 'made to' agree see above) so what needs to be considered are the following possibilities for corresponding cells in R and S and criterion C:

R-cell: empty | text-type | number-type

with

S-cell: empty | text-type | number-type

with

criterion C: text comparison | number-comparison

[Incidentally, the definition of "criteria" [4.9.8] is really poor -- it defines criteria variously and simultaneously as a set (of selectors), a rectangular set (of values) and one of a number, a logical value, a text value, or 'other text value' with no sense of irony or confusion. It badly needs rewriting.]

18 possibilities in all.

We can simplify things by stating (clearly, up-front) that empty cells referenced in a formula are treated as 0 or "" (the empty string) whenever, respectively, a number value or a text value is expected -- this appears to be the case in cell-formulae happily enough --

AND

the criterion is applied (to the R-cell) and evaluated as though the R-cell is referenced in a formula derived from the criterion.

For example: the criterion ">42" is treated as formula 'R1>42' (where, here, R1 is a cell reference from the range R) and if empty R1 is treated as 0 (resulting in FALSE). This is then consistent with a result obtained from an array of cells populated by conditional values built with criteria formed from C (which is what ought to guide the definition of array functions like SUMIF).

As another example, and more interesting, since it is a case that does >not< behave correctly in CALC at the moment: the criterion "<>y" should be treated as the formula 'R1<>"y"' and we would obtain the value TRUE if R1 referred to an empty cell (because it is a text comparison and so the value "" would be assumed which is, indeed, NOT equal to "y"). [At the moment, in Calc, such a cell is ignored and the corresponding cell in S is >not< added into the result.]

The only strange cases are the cases where the S range contains an empty cell (but the criterion applied to the corresponding R-cell evaluates to TRUE), in which case we can define it as follows: an empty S cell is treated as 0 (which is to say they (should) have no effect on the final value of the sum) and a text-value S cell is ignored (or possibly converted to a number if we can perform such a conversion -- I assume that this follows the normal rules -- see SUM()) and a number-value S cell is added as usual.

Notice that, with this explanation, the omission of parameter S can be explained by simply saying that R = S in the description above; the reference list case is defined by separately summing over each range in the list, and the results added.

I welcome comments. The basic principle I wish to raise here is one of consistency between functions and cell-formulae evaluations. I think this sort of consistency benefits the user and the implementer alike.

Can I help to improve this definition or otherwise to contribute to this document?

Regards,

Steve Powell

Follow-Ups:
- Re: [office-comment] [office-formula] SUMIF and behavior of empty cells in reference ranges
  - From: "David King" <openoffice@drking.plus.com>