dita message

Subject: RE: [dita] [dita-translation] TC/DITA/Translation Subcommittee Proposals
From: "Esrig, Bruce (Bruce)" <esrig@lucent.com>
To: dita@lists.oasis-open.org
Date: Wed, 15 Mar 2006 11:09:43 -0500
Perhaps we are already doing this, but we should carefully distinguish the two levels: byte stream and markup.

Here's a possible way to approach it.

The innermost element in the DITA markup should be a nestable phrase-like element that has an attribute that signifies the direction in which the text should be presented. Surrounding elements can take the same attribute. The applicable value of the direction (according to the markup) for a particular stream of characters is the value of the attribute for the innermost containing element that has a value specified for that attribute.

Unicode characters within the markup could override the attribute setting according to the markup, using the full set of Unicode conventions. However, the markup should not need an override value for an attribute because the markup can use inner nesting whenever a local override is required. The value of the attribute in the DITA markup should be specified to have this behavior. The Unicode values may not be good names for the attribute values, since the override behavior is different.

Because the DITA markup may be more concerned with presentation than with byte order in the source, we should keep in mind the possibility of covering the alternate directions that are sometimes used in Asian languages. (XSL-FO allows multiple directions, but the main example is: text direction top to bottom, line propagation direction right to left.)

Best wishes,

Bruce

-----Original Message-----
From: Eliot Kimber [mailto:ekimber@innodata-isogen.com]
Sent: Wednesday, March 15, 2006 10:21 AM
To: dita@lists.oasis-open.org
Subject: Re: [dita] [dita-translation] TC/DITA/Translation Subcommittee
Proposals


Gershon L Joseph wrote:

> 1. The Unicode standard defines a default direction for each language. For
> example, for English this default direction is LTR and for Hebrew it's RTL.

To expand on Gershon's explanation a little bit:

The directionality is actually defined for each character (not the 
language, as Unicode doesn't deal directly with languages but rather 
with scripts (that is, when Unicode talks about "Arabic" they mean the 
script named Arabic, not the language Arabic, which happens to use the 
Arabic script). For most languages there is a direct mapping to a script 
(but not always, although those cases usually fall outside the set of 
languages used for technical publications). Thus, while we usually 
informally talk about "languages" when dealing with things at the 
character level we usually really mean "scripts".

With respect to controlling (or not controlling) directionality, in 
addition to the cases where you have to adjust the default 
directionality is the case where you have characters that are naturally 
paired and that, by default, are rendered to reflect the current 
directionality, i.e., parens "(" and square brackets "[".

For example, given this source data:

<p>(arabic characters)</p>

The rendered result would be:

(sretcarahc cibara)

not:

)sretcarahc cibara(


However, when the parens contain or are adjacent to left-to-right 
characters then things can get confused (and confusing). I've seen this 
most when you have numbers or latin-script words embeded in 
right-to-left text within parens, i.e.:

<p>arabic characters (XSL-FO) arabic characters</p>

In this case the directionality of the enclosing Arabic characters 
causes the parens to be "flipped", giving results like this:

sretcarahc cibara )XSL-FO( sretcarahc cibara

or

sretcarahc cibara (XSL-FO( sretcarahc cibara

This can be addressed by using LRO or RLO characters or by using LRE or 
RLE (left-to-right embedding and right-to-left embedding) characters in 
the input data stream.

However, this is complicated by the fact that many tools do not 
implement the Unicode bi-di rules correctly. In particular we found that 
tools that rely on the Windows libraries for doing right-to-left 
rendering give different results from tools (mostly FO implementations) 
that implement the algorithm themselves (and presumably better). 
However, the algorithm is difficult enough to understand that its hard 
to prove that a given implementation is or isn't correct. So you are 
often faced with just hacking the source data until you get the result 
you want.

In my work with rendering Hebrew and Arabic technical documents where 
the translators are creating the non-English text *and* they can't 
modify the markupm their only option is to add the actual Unicode 
characters to the data stream (which they can always do).

Cheers,

Eliot
-- 
W. Eliot Kimber
Professional Services
Innodata Isogen
9390 Research Blvd, #410
Austin, TX 78759
(512) 372-8841

ekimber@innodata-isogen.com
www.innodata-isogen.com
Follow-Ups:
- RE: [dita] [dita-translation] TC/DITA/Translation Subcommittee Proposals
  - From: "Gershon L Joseph" <gershon@tech-tav.com>