office-formula message

Subject: NUMBERVALUE and VALUE (was: [office-formula] Groups - OpenFormulaSpecification 2007-03-22-wheeler (ODT) (openformula-20070322-wheeler.odt)uploaded)

From: Eike Rathke <erack@sun.com>
To: office-formula@lists.oasis-open.org
Date: Thu, 29 Mar 2007 00:07:25 +0200

Hi David,

Sorry for the delay, but Monday I had a day off and yesterday I was busy
with other things. I actually also wanted to send out this mail earlier
today but needed some time to make up my mind.. see below.

On Friday, 2007-03-23 22:38:11 +0000, David A. Wheeler wrote:

> Okay folks - to make ideas more concrete, I've posted another modified
> version of the document.  Consider this a "non-official" version, feel free
> to continue working off the March 22 release.

I'll take your modified version as a basis to work on. Browsing through
the changes I think we should accept most, but there are some details
I'm not contended with.

> VALUE is now clearly locale-dependent, and there are specific requirements
> for locale en_US (so that we can test it!).

It now says that "commas are ignored", which isn't quite correct. The
group separator should be "ignored" only if it is used according to the
locale's rules, which for most locales makes it equal to a thousands
separator. A string "1,2,3" ususally does not convert to the numeric
value 123, but generates an error instead. And of course 1,.2 is invalid
as well. Therefor the regexp also isn't correct:

[+|-]? \$?((\.[0-9]+)|([0-9,]+(\.[0-9]+)?([eE][+-]?[0-9]+)?))%?

Note also that blanks between sign and currency symbol are optional, as
they are between the currency symbol and the digits. Some applications
do not accept the sign before '$', some do not accept blanks in between.
As both, the integer and the fractional part with a leading decimal
separator, can be present standalone, some simplification can be made
there. I also doubt that applications should parse the percent sign
after an exponential value. I think that for he minimal requirement
the regexp should be

[+|-]?\$?([0-9]+(,[0-9]{3})*)?(\.[0-9]+)?(([eE][+-]?[0-9]+)|%)?

if I didn't make a mistake, could someone verify? However, that would
still not catch the cases where a separator was inserted every 6th
digit..

> I've also added NUMBERVALUE,
> which takes two parameters (the text to convert, and the character to use
> as the decimal point)... that won't handle ALL locales, but it'll handle a
> whole lot without needing to deal with "all possible locales".

Here I don't see why
| Regardless of the current locale, the implementation shall accept text
| representations that match this regular expression when DecimalPoint is
| “.” (a period):

[+|-]? \$?((\.[0-9]+)|([0-9,]+(\.[0-9]+)?([eE][+-]?[0-9]+)?))%?

this should parse a '$' currency symbol if the decimal separator is
a period. I also don't see why it should parse a comma group separator
but not others like a blank or apostrophe, or why it would not parse
group separators at all if the decimal separator is not a period.

Instead, I propose to leave out currency symbols and introduce a 3rd
parameter that specifies the group separator to be used, defaulted to
comma if the decimal separator is a period, and defaulted to a period if
the decimal separator is a comma. Here indeed the group separator could
be ignored instead of requiring a specific grouping, so the regexp could
read

[+|-]?([0-9]+(,[0-9])*)?(\.[0-9]+)?(([eE][+-]?[0-9]+)|%)?

I also don't think that having
| If the provided text does not match the pattern, an implementation must
| at least accept the same formats as VALUE does, and should accept the
| given DecimalPoint where appropriate (e.g., HH:MM:SS.sss or HH:MM:SS,sss
| depending on the DecimalPoint value).

is actually a good idea, for two reasons.

First, if we define NUMBERVALUE it should serve one specific purpose:
parse numbers. Not dates, not times, or whatsoever.

Second, if NUMBERVALUE was used with a specific separator the user did
that on purpose, otherwise he could had used the locale-dependent VALUE.
If NUMBERVALUE couldn't parse a string, falling back to VALUE with the
decimal separator exchanged most certainly would not deliver the result
intended, even if it did parse _something_.

  Eike

-- 
Automatic string conversions considered dangerous. They are the GOTO statements
of spreadsheets.  --Robert Weir on the OpenDocument formula subcommittee's list.

References:
- Groups - OpenFormula Specification 2007-03-22-wheeler (ODT) (openformula-20070322-wheeler.odt) uploaded
  - From: david.wheeler@OpenDocument.us