OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

xliff message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: [xliff] FW: BiDi draft for discussion


Hi All,

Here is a forwarded copy of a draft idea for BiDi support in Xliff 2.0 and the discussion following it. Since this is affecting both inline codes and the general structural parts of Xliff we said at the last call that this should be brought to the TC.

Thread in the archive is here: https://lists.oasis-open.org/archives/xliff-inline/201204/msg00009.html

---------------------------------------------------
Original Mail:

As promised on the last conference call I have created a first draft of schema extensions for BiDi support in Xliff 2.0. I mostly follow the design for HTML5 and extend it to bi-lingual files. I'm not sure we want to adopt the 'auto' direction so I left it out for now. See http://www.whatwg.org/specs/web-apps/current-work/#the-dir-attribute for discussion of it.

We need to be able to inform the processing applications what base text directions a specific piece of text in an XLIFF document has. In most cases all source or target texts within a document has the same base direction. Or in other cases the direction is set for a specific unit or segment. Since XLIFF files are bi-lingual by nature it is natural to provide this separately for source and target languages on the structural elements and then restrict it to a single value on the <source> and <target> level. Since this also include changes to the structural parts we should bring this to the TC once we have agreed on the semantics on the inline level.

<source> inherits the direction of its parent containers 'source-dir' and <target> inherits its 'target-dir'. All inline codes except <mrk> default to Left-To-Right for the native display direction, disp-dir, as most markup is designed as LTR. The native code direction is distinct from the general text direction. This will allow sensible rendering of for example XML elements embedded in Right-To-Left text by default. <mrk> is often used for comments / annotations and such it makes more sense to inherit the direction from the container, but I'm not sure this is the right place to define it. The <pc> and <sc> inline elements inherit the direction of their container element and that direction is employed as embedding direction for the span. The <ec> is ending the embedding started by its corresponding <sc> tag.

Direction = {ltr, rtl}

Attributes 'source-dir' and 'target-dir' added to the following tags with a default value of 'LTR':
<file>

Attributes 'source-dir' and 'target-dir' added to the following tags with a default value inherited from the closest parent container that can hold the attributes:
<unit>
<segment>
<ignorable>

Attribute 'dir' added to the following tag with a default value inherited from the closest parent container that can have the 'source-dir':
 <source>

Attribute 'dir' added to the following tag with a default value inherited from the closest parent container that can have the 'target-dir' attribute:
<target>

Attribute 'dir' added to the following tags with a default value inherited from the closest parent container that can hold the attribute:
<pc>
<sc>

Attribute disp-dir added to the following tags with default 'LTR':
<cp> (not really useful here might be better to always have LTR dir, but perhaps for consistency) <ph> <pc> <sc> <ec> <mrk> (not sure this is the place, perhaps we should move this to the content defining the text of the marker)

This should allow the direction to be specified as few times as possible and if nothing is specified both source and target is LTR. If all content in source or target is RTL the direction only need to be specified once in the file. Note that <mrk> cannot be used to specify a different direction for a span. In my opinion <mrk> should only be used for annotations and not to influence the back conversion process.

In addition to these attribute based directional markers we should allow the use of Unicode directional characters in the text flow. This is more convenient when a translator is entering text. It will be up to the back-conversion from XLIFF to native format to keep, remove or replace them.

Further work on defining the mapping of these attributes onto the Unicode Directional Algorithm ( http://unicode.org/reports/tr9/ ) is needed. I would propose that we treat the <unit> as terminating a paragraph and resetting the embedding state on the <unit> boundary. From a quick study of this I think it makes most sense if we use the direction set on the <unit> as the default text direction. And if a segment has a direction specified (even if it is the default direction) start an embedding run. Native codes should be displayed within a push / pop override if their direction is different from the default direction. Spans inside the segment (<pc> or <sc>+<ec>) would use embedding if a direction is specified. I'm not an expert on the Unicode directional algorithm so my mapping might not be the best or even the proper thing. What I try to achieve is that nothing special should need to happen for the most common case of only LTR text. Also a minimum of overrides / embedding should be done for all RTL text in either source or target.

There are two issues left to resolve in my opinion: How the <mrk> should work, and how to add a directional span to the target that does not exist in source. For example when a product name / trademark is copied from a LTR source into a RTL target it may need to be protected by a span if it is starting or ending in directionally neutral characters. One option would be adopting the <bdo> element from HTML5. Or simply rely on Unicode characters for this.

---------------------------------------------------
Following discussion:

-----Original Message-----
From: xliff-inline@lists.oasis-open.org [mailto:xliff-inline@lists.oasis-open.org] On Behalf Of Estreen, Fredrik
Sent: den 25 april 2012 10:53
To: Yves Savourel; xliff-inline@lists.oasis-open.org
Subject: RE: [xliff-inline] BiDi draft for discussion

Hi Yves,

Thanks for the comments and feedback. Some answers and discussion bellow.

> -----Original Message-----
> From: xliff-inline@lists.oasis-open.org 
> [mailto:xliff-inline@lists.oasis- open.org] On Behalf Of Yves Savourel
> Sent: den 24 april 2012 17:29
> To: xliff-inline@lists.oasis-open.org
> Subject: RE: [xliff-inline] BiDi draft for discussion
> 
> Hi Fredrik,
> 
> Many thanks for the thorough work you've done with this.
> 
> A few questions/notes:
> 
> 
> -- The source/target-dir attributes on <file>, <unit>, <segment>, 
> <ignorable> sound ok. This forces the processor to keep track on 
> inheritance. But I suppose this is fine.
> 
> A thought: What if we had no attributes until <unit> and the default 
> there would be based on the languages? So you wouldn't even have to 
> set anything except of the base direction was different that the default ones?
> 
The direction is not really a property of the language but rather the script used. 
In most cases a language only use one script but it has often changed over time, and some languages are even written in multiple different scripts. Another implication I considered here was that I'm not aware of any canonical list of languages and script/direction mapping standard to reference and it did not seem right that we should maintain one as part of XLIFF.

> 
> -- We'll have to define some processing expectation for the result of 
> a join of
> segments/ignorables: For example, if one segment is LTR and the next 
> RTL how do we carry that in the joined content?
> 
> 
> -- No 'auto' value? This seems to have been added recently to dir. You 
> think we don't need it?
>
To me auto seems to be added to cover cases where the HTML markup does not know what text it will contain in the end. Like an input box. Or dynamically added content. For XLIFF extraction / tagging we do not have dynamic content and it felt more appropriate that the extractor apply the algorithm used by 'auto'
and set the dir to either ltr or rtl as appropriate. When editing the editor need to understand bidi anyway and can thus manage the direction itself too. If we are translating from one dir to another the editors / processors need to understand basics of directionality anyway. So adding auto just seemed to put an unnecessary complication in the standard.
 
> Related to this: <bdo> or Unicode controls are equivalent, but what 
> about the new <bdi> in HTML5? I don't think it has a Unicode control 
> equivalent. I'm not sure but it looks like <bdi> would be equivalent 
> to <bdo dir='auto'> (which you can't have because <bdo> requires dir 
> to be set to either rtl or ltr).
>
The <bdi> element has very unique rendering implications. It starts a new fresh paragraph and applies the Unicode BiDi algorithm from scratch. Then the result is embedded like an image/object in the source run. As intended in the HTML spec it is mainly for dynamic texts with unknown directionality. Adding this would mean that it is no longer possible to use just a simple Unicode, BiDi aware text control to render a sequence of Unicode characters. To implement <bdi> style semantics the application would need to do higher level rendering manipulations itself.
 
> So if you have an original code like this:
> 
> <p dir=auto class="u2"><b><bdi>Teacher</bdi>:</b> ما اسمك؟</p>
> 
> I assume we could represent it like this:
> 
> <unit id='1' source-dir='auto'>
>  <segment>
>   <source><pc id='1'><pc id='2' dir='auto'>Teacher</pc>:</pc> ما 
> اسمك؟</source>  </segment> </unit>
> 
> No?
I think the HTML to XLIFF extractor should apply the 'auto' algorithm and choose 'ltr' or 'rtl' for the spans. On back conversion the example would keep 'auto' in the HTML. When translating the extractor and back converters would need to handle the 'dir' attribute of HTML and change it according to the requirement of the translation anyway. 

> 
> --- I'm still fuzzy on disp-dir. Is that just a way to specify the 
> directionality for the original data (regardless were they are stored)?
> 
Yes, I added it so that we would be able to support displaying RTL markup regardless of text flow direction. Although I have not seen it yet there will be files with RTL language markup vocabulary like:

<?xml version="1.1" encoding="UTF-8"?>
<جذر>Samlple <كبير>test</كبير> doc</جذر>

So adding support to define the base direction for display of the native content seemed like a useful feature.
> 
> Thanks
> -yves
> 

Regards,
Fredrik Estreen


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]