docbook-apps message

Subject: Re: [docbook-apps] Writing mode, xsl-fo output

From: "Tony Graham" <tgraham@mentea.net>
To: docbook-apps@lists.oasis-open.org
Date: Fri, 1 Apr 2011 21:31:48 +0100 (IST)

On Fri, April 1, 2011 7:08 pm, maxwell wrote:
> On Fri, 1 Apr 2011 10:40:16 -0700, "Bob Stayton" <bobs@sagehill.net>
> wrote:
>> But when you say "some rl-tb" text, do you mean a mixed language
> document?
>> In that case, the writing mode value should be for the dominant
> language,
>> since the document's writing mode determines the page layout..
>> Any inline translated text should get the
>> correct text direction based on its Unicode character range.
>
> That last sentence--that the writing direction can be determined by
> inspecting the characters--is a common intuition (it was once my own
> intuition).  But it isn't quite that simple, since some symmetrical
> punctuation marks belong sometimes to L2R text, and sometimes to R2L text.

The conventional approach is to implement the Unicode Bidirectional
Algorithm [1] (or use a library that already implements it).  It may not
be perfect -- every so often you'll meet people who say it isn't good
enough -- but since it's up to revision 23 so far, you'll see they're
still trying to make it as perfect as possible.

> For example, an ASCII period at the end of a run of R2L text might belong
> at the left end of the R2L text, or--if the R2L text is at the end of an
> L2R text--it might belong at the right end of the L2R text (and therefore
> at the right end of the R2L text).

The BIDI algorithm has rules about resolving direction among characters
with strong, weak, and neutral directionality.

> Unsymmetrical punctuation marks sometimes exist as distinct L2R and R2L
> code points in Unicode, like the ASCII comma vs. the Arabic comma U+060C.
> But Parentheses (which of course are asymmetrical) are also sometimes used
> inside runs of R2L text--I've seen them in Urdu, for example.  Here I
> believe the ASCII open parenthesis is used as an Urdu close paren, and
> vice
> versa.

If you're using the BIDI algorithm, you'd always enter the open
parentheses as the '(' character even when it will be shown with its
mirrored glyph ')'.
See http://www.unicode.org/reports/tr9/#Mirroring

> Space characters of course also fall into this category of ambiguous
> direction, although that's generally handled correctly by algorithmic
> methods.
>
> There's been considerable discussion of this general issue (whether it's
> possible to algorithmically determine the ends of an R2L run inside an L2R
> run, or vice versa) over on the XeTeX mailing list.  The opinion of Those
> Who Know seems to be that it is not 100% decidable.

Which is why there's also characters for explicit overrides.

XML and other markup languages count as "higher level protocols" for the
purposes of the BIDI algorithm, and a 'dir' attribute or similar should be
used instead of the override characters.  See
http://www.unicode.org/reports/tr20/#Bidi

Regards,

Tony Graham
Mentea.

[1] http://www.unicode.org/reports/tr9/

References:
- Writing mode, xsl-fo output
  - From: Dave Pawson <davep@dpawson.co.uk>
- Re: [docbook-apps] Writing mode, xsl-fo output
  - From: "Bob Stayton" <bobs@sagehill.net>
- Re: [docbook-apps] Writing mode, xsl-fo output
  - From: maxwell <maxwell@umiacs.umd.edu>