xliff message

Subject: RE: [xliff] Simplified XLIFF element tree
From: Yves Savourel <ysavourel@translate.com>
To: "'xliff'" <xliff@lists.oasis-open.org>
Date: Wed, 1 Sep 2010 22:47:30 -0600
I've tried to take a step back and look at this from a more general view.
Here are a my two cents:


=== First, the issue of storing the un-segmented content:

There seem to be two models:

A) storing the segmented content as a copy of the original source.

B) marking up the source with specific elements for segmentation.

The solution A requires to duplicate the source, which means a danger of having possible discrepancies between the original and the segmented source (what do we do in those cases?). It seems also a waste of space: no matter what the tool used large files always end up being a problem at some point. We should try to avoid making it worst.

Note that having a separate original-source would be basically the reverse of <seg-source>: If <seg-source> was a problem, it's likely that just reversing the roles won't fix much...

I think Asgeir notion that having a unique source content with a clear markup associated with clear processing expectations is a good starting point. Note that segmentation is only one of the possible annotations we can have on the source, other could be comments, term identification, etc.

To me all this indicates that a single source with different layers is a promising possibility. There is no reason it could not work well, after all other systems like xml:tm are using similar mechanisms (layers on top of a single content) and it seems to be fine :)


=== Languages and parts

With that single source in mind, if we step back and look at what we have we get this:
- We have extracted units.
- Each one has a source and a target,
- and each source and target can be made of one or more parts.

I think we have two possible main models to represent this (with different variations): We can group by part or by language.
I'm using meaningless element names to try to abstract the representations. In both models:

<aaa> contains the unit/entry (basically a trans-unit is in 1.2). The content of that entry may be broken down into one or more parts.

<bbb> contains the text for each language.

<ccc> contains the text for each part.


=== A) Grouped by languages

--- A.1) Grouped by language: un-translated, only one part (for example un-segmented):

<aaa id='id1'>
 <bbb xml:lang='en'>
  <ccc id='1'>Sentence one. Sentence two.</ccc>
 </bbb>
</aaa>


--- A.2) Grouped by language: un-translated, but with several parts:

<aaa id='id1'>
 <bbb xml:lang='en'>
  <ccc id='1'>Sentence one. </ccc>
  <ccc id='2'>Sentence two.</ccc>
 </bbb>
</aaa>


--- A.3: Grouped by language: translated and with several parts, the same parts between languages are linked through their ids:

<aaa id='id1'>
 <bbb xml:lang='en'>
  <ccc id='1'>Sentence one. </ccc>
  <ccc id='2'>Seentence two.</ccc>
 </bbb>
 <bbb xml:lang='fr'>
  <ccc id='1'>Phrase un. </ccc>
  <ccc id='2'>Phrase deux.</ccc>
 </bbb>
</aaa>



=== B) Now grouped by parts:

--- B.1) Grouped by part, un-translated, only one part (for example un-segmented):

<aaa id='id1'>
 <ccc>
  <bbb xml:lang='en'>Sentence one. Sentence two.</bbb>
 </ccc>
</aaa>


--- B.2) Grouped by part, un-translated but with several parts:

<aaa id='id1'>
 <ccc>
  <bbb xml:lang='en'>Sentence one. </bbb>
 </ccc>
 <ccc>
  <bbb xml:lang='en'>Sentence two.</bbb>
 </ccc>
</aaa>


--- B.3) Grouped by part, translated and with several parts, you don't need id to link the parts since each <bbb> contains all languages for that part.

<aaa id='id1'>
 <ccc>
  <bbb xml:lang='en'>Sentence one. </bbb>
  <bbb xml:lang='fr'>Phrase un. </bbb>
 </ccc>
 <ccc>
  <bbb xml:lang='en'>Sentence two.</bbb>
  <bbb xml:lang='fr'>Phrase deux.</bbb>
 </ccc>
</aaa>


=== Now some actions and the pros and cons:

Obviously this is very abstract as in most cases the tools would read a single <aaa> element and hold in memory, in their own structure. So they could access any parts of aaa easily. This is maybe for more XSL access or very simple parser relying on the XLIFF structure itself to perform their actions.

--- Associating parts with various flags (for example state of translation):

No difference for both solutions: We just have the relevant attributes in <ccc>, or associate <ccc> with other property-like elements using reference ids.


--- Accessing the whole content for one language (for example the un-segmented source):

A) looks easier because the selection of <bbb> gives the whole content immediately. If the content is broken into several parts we need to remove those tags, but it's a simple operation.

B) looks more difficult because you have to skip over other <bbb> elements to get the ones you want. This is not an issue with a DOM processor, but it's a lot more complex than A for a stream-based processor.


--- Segmenting a content:

A) Is easy: we just create <ccc> elements as needed inside the <bbb>.

B) Is almost as simple: the <ccc> elements have <bbb>: we create two elements instead of one, no problem.


--- Un-segmenting a content:

A) We just remove the <ccc> and keep only one for the whole content.

B) Is almost as easy: just a little more manipulations, but nothing very complicated.


--- Accessing source/target pairs (for example to feed a TM):

B) looks obviously easier because it already grouped by pairs.

A) is a bit more difficult because you have to pair the two parts using their ids. Note that it's not too bad because you do have access to all the parts in one call in <bbb>.


--- Alignment and parts manipulation (for example "Sentence 1" is translate by "Phrase 2"):

A) looks easier because the links by id does not force the parts to be in the same order in each <bbb>.

B) is actually much more troublesome for this: it cannot represent n-to-m alignments without empty <bbb> and cannot represent a first source sentence translated by the last target sentence without either changing the order of the <ccc> in the source or the target and add some extra attributes to indicate how things are linked.


--- Merging (for example merging translated text into the skeleton):

A) has a slight advantage because we can access the whole translation in one call to a <bbb>.

B) requires to re-construct the translation from the different <ccc>.


--- Interstices: This will require us to make a choice on what exactly is inside a part: For example does a segment contains the whitespaces (and possibly inline codes or text) outside the segment or not. In other words, is a content like:

This [<ccc>Sentence one.</ccc> <ccc>Sentence two.</ccc>]

Or [<ccc>Sentence one. </ccc><ccc>Sentence two.</ccc>]

Or [<ccc >Sentence one.</ccc><ccc> Sentence two.</ccc>]

Or [<ccc >Sentence one.</ccc><ccc> </ccc><ccc>Sentence two.</ccc>]

I'm not making judgment on what is the best way, but just say that we will need to decide which representation must be used in XLIFF. And obviously, tools do not have to represent their own segment the way XLIFF will.

In the case that content is allowed outside <ccc>, then this is much more awkward to represent with model B and potentially very verbose. A is more flexible or any of those representations.


--- Overall both models have strengths and weaknesses.

In both cases:
- We don't force to pre-segment
- We don't force to not pre-segment.
- We have only one source content.
- We can attach info to the entry the language and the parts.

It seems to me that A is a slightly better representation because it allows more flexibility and an easier manipulation of the parts (especially n-to-m). The model B seems to have an advantage in not requiring id to map the source and target parts. But in practice it's very likely that tools will have ids for those constructs anyway. B has an easier way to associate pairs of source/target parts when they are 1-to-1, but is more complicated for n-to-m.


Cheers,
-ys