OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

office-formula message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: RE: Re: [office-formula] Summary of 2009-10-13 teleconference


I follow the arguments here and am aligned with David's appraisal.

 - Dennis

MORE THOUGHTS ON CHAR AND CODE

However, I find it odd that there is a presumption of that CODE and CHAR
might not have more implementation-dependent range than merely 128 to 255.
The problem we have is that if CODE produces values greater than 127 and
CHAR accepts any values greater than 127, we have no idea what character set
those are understood to be code points for an in an interchange situation,
the only agreement that can be made is by some sort of out-of-band
arrangement.  

In addition, CHAR may fail for (some) parameter values beyond 127, for
negative values, and I suppose, for non-integral values if we don't cover
that case already in terms of what is accepted where an (unsigned) integer
value is required.  Certainly if an understood character set has no code
point for a given value, some sort of error should probably result.  (I
note, in passing, that in double byte codings, some octet values from 128 to
255 are reserved for the pair-halves and are not code points as that is
generally meant.  Even for Unicode, there are numbers within the code point
range that are banned as code points - CODE should never produce one and
CHAR should probably fail to accept one.)

I suggested previously that interoperability *failure* would be easier to
detect and to gracefully allow for in whatever the purpose of the formula is
if there was a way to either (1) inquire of the implementation or (2)
specify to the implementation what character-set code points are to be
understood as parameters of CHAR and results of CODE (outside the ASCII
range).  Although this is exactly the kind of situation where standards can
reasonably invent to encourage movement toward an interoperable practice, I
recall being dealt the "when someone implements this, we can consider
standardizing it" card.

  "Standards are arbitrary solutions to recurring problems." 
     -- Robert W. Bemer 
        (sometimes known as the father of ASCII).

 
Dennis E. Hamilton
------------------
NuovoDoc: Design for Document System Interoperability 
mailto:Dennis.Hamilton@acm.org | gsm:+1-206.779.9430 
http://NuovoDoc.com http://ODMA.info/dev/ http://nfoWorks.org 






-----Original Message-----
From: David A. Wheeler [mailto:dwheeler@dwheeler.com] 
Sent: Wednesday, October 14, 2009 10:05
To: office-formula@lists.oasis-open.org
Subject: Fwd: Re: [office-formula] Summary of 2009-10-13 teleconference

I said:
> > * The "Text" (String) type will simply defined as something that can
contain 0 or more "characters".  We will separately discuss what "character"
means.  I personally think we should recommend, but not require, that
implementations support a character set and encoding that permit any legal
Unicode code point be a character, but we didn't discuss that today.
 
Eike Rathke:
> I think we should say that implementations
> - shall support Unicode BMP and
> - should support the entire Unicode character range.
> 
> This comforts those that internally use UCS2 encoding only, though
> I don't know if there are implementations limited such. I don't think
> there would be implementations of OpenFormula that do not support
> Unicode BMP.

I get the impression that some implementations (at least Excel) *vary* the
internal coding that they use for characters, based on a platform setting,
and that this fact leaks into the results of text processing.  E.G., if a
platform's setting says to use 8-bit ISO 8859-1 (Latin-1), then it CANNOT
represent the entire Unicode BMP internally, even though it CAN do so if a
different platform setting is used, and even though the implementation IS
capable of reading XML files containing the entire Unicode range.

Now, we could say that implementations simply have to have a
standards-compliant setting, but if people won't make that the NORMAL case,
then there's a problem.  We don't want a "standards ghetto" that people
won't want to use in real life.

> > * We will need to add a discussion noting that implementations may have
a specific character set and character encoding as a setting, and that this
may limit which characters may be included in strings.
> 
> I don't see how that would affect the specification other than we could
> note that some implementations are limited and thus results may differ
> if an ODF/OpenFormula document is read by such. I don't think that is
> the responsibility of the specification though.

It does if it affects interoperability.

> > Do we need to have a way to STORE this information in an OpenDocument
file? If so, how?
> 
> I don't think so. Strings are stored in the encoding given by <?xml
encoding="...">
> I don't see much benefit in storing the internal encoding of the
> generating implementation, other than readers would be required to
> convert from their internal Unicode encoding to that other encoding for
> functions such as CODE() and CHAR(). Doing so would impose a bunch of
> otherwise unnecessary conversion routines on implementations, maybe even
> including encodings not registered with IANA. However, we define those
> functions to ASCII for values 1<=N<=127 and to be
> implementation-dependent for values 128<=N<255 anyway.

The issue is that if a platform uses an internal representation that CANNOT
represent arbitrary Unicode strings, then we need to tell the implementation
information what the character requirements ARE, so that data isn't lost.
And we must give that information before character data is converted to the
internal format, else it's a pain.  E.G., it's okay for an implementation to
use a Latin-1 representation internally AS LONG AS all characters to be
exchanged in that spreadsheet are in the Latin-1 set.

All of this is irrelevant if the implementation always uses an internal
representation can represent all Unicode code points (e.g., UTF-8 or
UTF-16).  But this does not appear to be the case.

> > * CHAR() and CODE() are *not* deprecated... they stay in.  Instead, they
"normalize" to ASCII values, regardless of the internal representation.  I
believe this only affects those who use a particular Arabic encoding that
uses 0...127 for non-ASCII characters, and only when using those functions.
> 
> We still define 128..255 to be implementaion-dependent, yes?

Yes, unless someone objects.

--- David A. Wheeler

---------------------------------------------------------------------
To unsubscribe from this mail list, you must leave the OASIS TC that
generates this mail.  Follow this link to all your TCs in OASIS at:
https://www.oasis-open.org/apps/org/workgroup/portal/my_workgroups.php 



[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]