dita message

Subject: RE: [dita] [dita-translation] TC/DITA/Translation Subcommittee Proposals
From: "Paul Prescod" <paul.prescod@blastradius.com>
To: "Esrig, Bruce \(Bruce\)" <esrig@lucent.com>,<dita@lists.oasis-open.org>
Date: Wed, 15 Mar 2006 10:19:59 -0800
My understanding is that these best practices were worked out with a
much larger team of experts at the W3C. Is it really effective for us to
re-engineer them? It would be better to wait for the translation
sub-committee to document the industry best practices in detail and then
explain the standard reasoning rather than start from scratch as if we
are inventing something new. As an XML vendor, I really do not want
XHTML, DocBook and DITA to adopt different rules for this stuff for no
reason.

With respect to your discussion below, I have some issues of
terminology: 

Bytes: bytes are not really relevant. The lowest level of abstraction we
should discuss is characters. 

Presentation: As far as I know, nothing we are talking about here can
really be termed "presentation". If characters are not shown in the
correct order they cannot be deterpretni yb eht redaer. So we're talking
about something between content and structure but not something
context-mutable like presentation.

> -----Original Message-----
> From: Esrig, Bruce (Bruce) [mailto:esrig@lucent.com] 
> Sent: Wednesday, March 15, 2006 8:10 AM
> To: dita@lists.oasis-open.org
> Cc: 'Eliot Kimber'
> Subject: RE: [dita] [dita-translation] TC/DITA/Translation 
> Subcommittee Proposals
> 
> Perhaps we are already doing this, but we should carefully 
> distinguish the two levels: byte stream and markup.
> 
> Here's a possible way to approach it.
> 
> The innermost element in the DITA markup should be a nestable 
> phrase-like element that has an attribute that signifies the 
> direction in which the text should be presented. Surrounding 
> elements can take the same attribute. The applicable value of 
> the direction (according to the markup) for a particular 
> stream of characters is the value of the attribute for the 
> innermost containing element that has a value specified for 
> that attribute.
> 
> Unicode characters within the markup could override the 
> attribute setting according to the markup, using the full set 
> of Unicode conventions. However, the markup should not need 
> an override value for an attribute because the markup can use 
> inner nesting whenever a local override is required. The 
> value of the attribute in the DITA markup should be specified 
> to have this behavior. The Unicode values may not be good 
> names for the attribute values, since the override behavior 
> is different.
> 
> Because the DITA markup may be more concerned with 
> presentation than with byte order in the source, we should 
> keep in mind the possibility of covering the alternate 
> directions that are sometimes used in Asian languages. 
> (XSL-FO allows multiple directions, but the main example is: 
> text direction top to bottom, line propagation direction 
> right to left.)
> 
> Best wishes,
> 
> Bruce
> 
> -----Original Message-----
> From: Eliot Kimber [mailto:ekimber@innodata-isogen.com]
> Sent: Wednesday, March 15, 2006 10:21 AM
> To: dita@lists.oasis-open.org
> Subject: Re: [dita] [dita-translation] TC/DITA/Translation 
> Subcommittee Proposals
> 
> 
> Gershon L Joseph wrote:
> 
> > 1. The Unicode standard defines a default direction for 
> each language. 
> > For example, for English this default direction is LTR and 
> for Hebrew it's RTL.
> 
> To expand on Gershon's explanation a little bit:
> 
> The directionality is actually defined for each character 
> (not the language, as Unicode doesn't deal directly with 
> languages but rather with scripts (that is, when Unicode 
> talks about "Arabic" they mean the script named Arabic, not 
> the language Arabic, which happens to use the Arabic script). 
> For most languages there is a direct mapping to a script (but 
> not always, although those cases usually fall outside the set 
> of languages used for technical publications). Thus, while we 
> usually informally talk about "languages" when dealing with 
> things at the character level we usually really mean "scripts".
> 
> With respect to controlling (or not controlling) 
> directionality, in addition to the cases where you have to 
> adjust the default directionality is the case where you have 
> characters that are naturally paired and that, by default, 
> are rendered to reflect the current directionality, i.e., 
> parens "(" and square brackets "[".
> 
> For example, given this source data:
> 
> <p>(arabic characters)</p>
> 
> The rendered result would be:
> 
> (sretcarahc cibara)
> 
> not:
> 
> )sretcarahc cibara(
> 
> 
> However, when the parens contain or are adjacent to 
> left-to-right characters then things can get confused (and 
> confusing). I've seen this most when you have numbers or 
> latin-script words embeded in right-to-left text within parens, i.e.:
> 
> <p>arabic characters (XSL-FO) arabic characters</p>
> 
> In this case the directionality of the enclosing Arabic 
> characters causes the parens to be "flipped", giving results 
> like this:
> 
> sretcarahc cibara )XSL-FO( sretcarahc cibara
> 
> or
> 
> sretcarahc cibara (XSL-FO( sretcarahc cibara
> 
> This can be addressed by using LRO or RLO characters or by 
> using LRE or RLE (left-to-right embedding and right-to-left 
> embedding) characters in the input data stream.
> 
> However, this is complicated by the fact that many tools do 
> not implement the Unicode bi-di rules correctly. In 
> particular we found that tools that rely on the Windows 
> libraries for doing right-to-left rendering give different 
> results from tools (mostly FO implementations) that implement 
> the algorithm themselves (and presumably better). 
> However, the algorithm is difficult enough to understand that 
> its hard to prove that a given implementation is or isn't 
> correct. So you are often faced with just hacking the source 
> data until you get the result you want.
> 
> In my work with rendering Hebrew and Arabic technical 
> documents where the translators are creating the non-English 
> text *and* they can't modify the markupm their only option is 
> to add the actual Unicode characters to the data stream 
> (which they can always do).
> 
> Cheers,
> 
> Eliot
> --
> W. Eliot Kimber
> Professional Services
> Innodata Isogen
> 9390 Research Blvd, #410
> Austin, TX 78759
> (512) 372-8841
> 
> ekimber@innodata-isogen.com
> www.innodata-isogen.com
>