office message

Subject: [OASIS Issue Tracker] Commented: (OFFICE-1935) Review 1.2specification with respect to Unicode usage
From: OASIS Issues Tracker <workgroup_mailer@lists.oasis-open.org>
To: office@lists.oasis-open.org
Date: Fri, 24 Sep 2010 14:04:15 -0400 (EDT)

    [ http://tools.oasis-open.org/issues/browse/OFFICE-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=21583#action_21583 ] 

Dennis Hamilton commented on OFFICE-1935:
-----------------------------------------

Rob, this seems in the right direction.  A couple of observations, some of which are broader than just Unicode usage, so perhaps we should split out some subtasks:

Regarding 6.1.1
----------------------

It looks like it should say something like "contain the character data that comprises the text of the document" and then go on to talk about character data.  I suppose we mean to limit ourselves to the character data of the OpenDocument text, the <text:p> text's character data, the <text:h> text,'s character data etc. and not all character data in the XML markup of the ODF Document generally.  In particular, the XML term may include whitespace, unnormalized white space, etc.  We also need to be clear how the character data of field elements in the markup fits into this story.

If I follow your remark, the XML definition of character data doesn't work for us.  It is true for the XML document but not true for the ODF document, in that there is character data in the XML sense that is not for the text of the document (and may be in <text:p> and <text:h> elements.  For example, there is character data in the elements of <text:tracked-changes> that are not to be considered part of the text of the document (especially in the annotations that are permitted in the <text:tracked-changes> child elements.  

The RNG Schema [IS 19757-2:2003] speaks of element children which consists of any child elements and also, when allowed by the <text /> schema pattern, non-empty strings of XML-allowed characters.  Technically, an element has an ordered sequence of zero or more children where each child is either an element or a non-empty string and there are no consecutive string children.  (if there is no <text /> pattern, then apparently any non-empty strings consisting entirely of white space are ignored and there are evidently no string children as far as the data model is concerned.  (There are other cases, such as text-only elements where the string must satisfy a datatype.  I think we can safely regard those as equivalent to having a specialized <text /> pattern, and contributing to the text content of the document accordingly.)

Regarding 6.1.2(1)
--------------------------
 It looks like the bullet having SPACE (U+0020) should simply be removed.  The operative collapsing of the existing and resulting SPACE occurrences seems to be handled as intended by steps (2-4).  

Interaction with 6.1.3
---------------------------

I note that 6.1.3 makes no sense in this context.  Especially the SHALL.  It seems to me that this is used to represent a single space in the text's character data that is not subject to white-space collapsing.  Period.  Then drop the note in 6.1.3 also.  It might be important to observe that the rules in 6.1.2 do not consider the <text:s> space to be a space subject to consideration in 6.1.2.  So 
   <text:p>&#x20;&#x20;<text:s />&#x20;<text:s />&#x20&#x20</text:p> contributes exactly three consecutive SPACE characters to the text of the document if it contributes any.

Regarding 19.135.1.  
---------------------------
I suggest simply noticing that the string datatype is used, by saying "the name may be an arbitrary value of string datatype".

Regarding 19.364
------------------------
I am not sure the "with numeric value 1" is pertinent, unless DIGIT ONE actually occurs where that is not the case.  In any case, we definitely need a way to find those text files.  Also, it should be made clear that the default value is the single character "1" (DIGIT ONE, U+0031), not an integer having that value.

Regarding Section 19.598
-----------------------------------
I agree about the use of explicit quotes.  I believe it should be single-quotes (APOSTROPHE, U+0027) (out of harmony with OpenFormula) and more convenient for attribute values enclosed in double-quote characters.  We may need a rule that allows a pair to escape to a single-quote in the string itself.

> Review 1.2 specification with respect to Unicode usage
> ------------------------------------------------------
>
>                 Key: OFFICE-1935
>                 URL: http://tools.oasis-open.org/issues/browse/OFFICE-1935
>             Project: OASIS Open Document Format for Office Applications (OpenDocument) TC
>          Issue Type: Bug
>          Components: Locale, Text
>    Affects Versions: ODF 1.2 CD 05
>            Reporter: Robert Weir 
>            Assignee: Robert Weir 
>             Fix For: ODF 1.2 CD 06
>
>
> We should review the ODF 1.2 specification, in particular for the following:
> 1) Are all character literals specifying their code points, e.g., '1' (U+0030).  Remember, not every reader of the standard will be a native English speaker or even a native user of Latin-1 characters.  Since Unicode defines several characters that may look like a plus sign, or a dash, we need to be explicit.
> 2) Are we crystal clear on whitespace treatment?
> 3) Bidi?
> 4) Whenever we talk about sorting, are we clear on whether this is lexical or a locale-dependent collation order?
> 5) What Unicode version? 
> 6) For most of ODF we can deal with Unicode characters and strings of Unicode characters without discussing encodings.  For serialization we permit whatever XML permits and we don't need to deal with encoded characters.  However there are some exceptions that we need to be more explicit with.  One is passwords entered during encryption.  Since the encryption algorithms work at the bit level, both encoding and byte ordering need to be specified.
> 7) Any functions that deal with upper case/lower case conversions, such as in OpenFormula, need to make sure they are specified correctly with respect to Unicode.  
> 8) Anything else?
> Suggest search phrases are: character*, sort, search, collation, unicode, encod*, encrypt*, string (unless it is xsd:string), *space, dash, hyphen, 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://tools.oasis-open.org/issues/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira