office-formula message

Subject: Openformula JIRA processing meeting today, 2009-09-01
From: "David A. Wheeler" <dwheeler@dwheeler.com>
To: office-formula@lists.oasis-open.org
Date: Tue, 01 Sep 2009 11:26:19 -0400 (EDT)
We just finished our Openformula JIRA processing meeting of today, 2009-09-01.  These are my rough notes of the meeting today; please "reply" with any important corrections.

Attendees:
David A. Wheeler
Patrick D.
Eike Rathke
Eric Patterson
Patrick Strader
Rob Weir
Andreas Guelzow
Dennis Hamilton

Agenda:
* JIRA status (Rob Weir)
  - The JIRA views had been inaccurate.  This was due to an update of the software, which caused the indexes to not be updated (they were constant) although the underlying database was getting correctly updated. This meant that some views were showing old data.  This is now fixed.
  - We still don't have editing rights.  Rob will try once again to get this resolved.  Wheeler said that if we can't get this resolved by the end of the week, we will just declare special comments that will (for us) have the same effect.  That's not as efficient as a database query, but it will work.
* Discussion on run-time character processing: This took almost the entire meeting; see below.
* (No time to discuss other issues)
* Unassigned items: Wheeler asked everyone to assign to themselves (on JIRA) the comments they'd volunteered to process, today.  After today, Rob Weir will go back to his notes and fix the database by assigning people the comments they'd agreed to, in case they'd forgotten one.
* ISO numeric representation.  ISO requires that pure numbers use "," as the decimal point, space to separate trios of digits, and to not use central dots for multiplication.  We'll need to make that change.
* Meeting ended; Wheeler reminded everyone that the next meeting will be 2009-09-15.


RUN-TIME CHARACTER PROCESSING

Most of the meeting was spent on discussing run-time character processing.  This impacts the text functions (both the *B and non-*B functions), especially CODE and CHAR.  I (Wheeler) tried to write down the key points and who made them, but this is only a summary below.

Eric P. had taken up the task of trying to find out what Excel does, and reported that. Eric reported that RIGHTB Excel Processes at byte level, then converts back to characters.  It processes the IN-MEMORY representation, not the file representation.  Eric believes they're using UTF-8, but needs to check.  Eric modified OFFICE-1895 to try to clarify.

Weir: There are many different kinds of characters, e.g.:
 Unicode character
 Unicode encoded string (e.g., UTF-8, UTF-16)
 XML string (encoding of this)
 Numeric entities
 "Run-time string"
 Need to clarify what we are talking about.

Eike: CHAR and CODE are very dependent on the system that generated it, and are legacy functions that go back to codepage time. Don't mix RIGHTB and RIGHT with them.

There was a discussion about what to do with the *B functions if the byte position is invalid.
Dennis: RIGHTB etc. need to be VERY clear that these are BYTE positions of the current implementation representation (which is implementation-defined).
Wheeler: If they're invalid characters, the result is implementation-defined
Eric: Where the conversion is invalid, convert to either Error or some valid Text.
Weir: Just declare as implementation-defined, there are many mechanisms.

Dennis and Eike: "Do everything in terms of Unicode codepoints, and NOT characters".  Wheeler: Text is a sequence of Unicode codepoints. Then combining characters, etc., are dealt with reasonably.
Wheeler: For "B" functions, we expose the bytecode positions... we need to specify a few axioms with LEFTB and FINDB.
?: The "B" functions leak implementation details.  We just need to admit that.

Eike: When there were codepages, etc., the codepage depended on the system.  When imported, it basically loses its meaning... you'd have to know where it originated from to repeat the same results.
Weir: CHAR is *intended* to find the implementation representation.
Eric: In Excel, we indicate the CHAR function returns the character represented based on the Latin-1 8859-1 on Windows, and on Mac it uses the Mac char set.  Thus, platform-dependent.
Eike: He has a Russian file, CODE and CHAR produce completely different results.  I'm fine if returns Latin-1 everywhere.
Weir: We could say, implementation-defined, but 0-127 must be ASCII.

Wheeler: Should we use the term "Code point" everywhere?
Weir: What if one system uses combining characters, and others don't?   What does "=" do?  Is there normalization?
Weir: Might make a blanket statement that equivalent strings produce equivalent results, either by normalization or by "=".
?: We should work towards having highly predictable results.
Weir: We should look at XSLT.

Spec decisions:
The "B" functions continue, as byte positions, but their details are implementation-defined.
CODE/CHAR: 0-127 must be ASCII, else implementation-defined.
UNICODE/UNICHAR use Unicode, always, on code points.

Patrick D. will examine some existing specifications and post an email discussing how they handle some of the character issues.  Wheeler asked him to specifically examine how they handle character normalization (e.g., if a character can be represented as either a letter followed by a combining accent, or as a character with the accent embedded, how is this handled, especially by "="?).

Next meeting will be 2009-09-15.

--- David A. Wheeler