xliff message

Subject: Generic mechanism for translation candidate elements and other annotations

From: Yves Savourel <ysavourel@enlaso.com>
To: <xliff@lists.oasis-open.org>
Date: Tue, 6 Mar 2012 20:06:44 -0700

Hi everyone,

The current proposal for translation candidates calls for a fairly simple structure where a <matches> element that holds a list of <match> and can be associated to a <segment>, or a <unit>.

However, at the light of the following two issues, I'm not sure if this is enough:


a) The first issue is that segments can be re-segmented.

This means some <match> may become invalid and should be removed or somehow flagged or their score modified to indicate they don't correspond any more to the segment they are attached to.

For example, initial entry:

<unit id="1">
 <segment>
  <source>Some text: and more</source>
  <matches>
   <match>
    <source>Some text: and more</source>
    <target>Du texte : et plus</target>
   </match>
  </matches>
 </segment>
</unit>

Then after re-segmentation, we'll have to decide what to do with the match:

<unit id="1">
<segment>
  <source>Some text: </source>
  <matches>
   <match> <!-- Not a good match any more -->
    <source>Some text: and more</source>
    <target>Du texte : et plus</target>
   </match>
  </matches>
 </segment>
 <segment>
  <source>and more</source>
 </segment>
</unit>


b) The second issue is that, nowadays, translation candidates are not just for segments.

More and more tools provide phrase-level matches (sub-sentence matches). The current mechanism does not handle such cases.


But more importantly, in addition to these two issues, the case of <match> is an illustration of a more general challenge that XLIFF 2.0 needs to tackle in a consistent way: annotations.

More and more processes are be able to 'enriched' the extracted document with information pertaining to a span of the content (which may or may not correspond to a segment). Translation candidates are just one case among many. Attaching matches to a source content is not very different from associating QA errors to a chunk of text, or labeling a phrase with a translator comment, etc.

I think we need to have a common pattern to implement such features. This may allow us to have also common processing expectations and address at the core level potential problems with modules/extensions that follow the same pattern.


To go back to our <match> example: One possibly way to solve this could be to link a <match> not to a <segment> or a <unit> but to an <mrk>.

<unit id="1">
 <segment>
  <source><mrk id='1' type='match' ref='m1'>Some text: and more</mrk></source>
 </segment>
 <matches>
  <match id='m1'>
   <source>Some text: and more</source>
   <target>Du texte : et plus</target>
  </match>
 </matches>
</unit>

After re-segmentation the match is still valid.

<unit id="1">
 <segment>
  <source><sm id='1' type='match' ref='m1'/>Some text: </source>
 </segment>
 <segment>
  <source>and more<em rid='1'/></source>
 </segment>
 <matches>
  <match id='m1'>
   <source>Some text: and more</source>
   <target>Du texte : et plus</target>
  </match>
 </matches>
</unit>

[Note: <sm/> and <em/> are just a way to represent a broken <mrk>, (like <sc/> and <ec/> for <pc>). The Inline SC has not worked out completely how to represent this, but the bottom line is that we'll have some representation of non-well-formed <mrk>.]


The drawback of using <mrk> for <match> is obviously the added complexity when the span associated with the translation candidate corresponds to an entire segment (which is most of the cases).

I suppose we could imagine some well-defined 'shortcut' way to declare that a <mrk> that spans the full content of a <segment> is linked to its <match> in some implicit way and can be omitted.

For example: <match id='m1' segment='seg1'>...</match> when the match m1 is associated with the entire content of the segment seg1. For example:

<unit id="1">
 <segment id='seg1'>
  <source>Some text: and more</source>
 </segment>
 <matches>
  <match id='m1' segment='seg1'>
   <source>Some text: and more</source>
   <target>Du texte : et plus</target>
  </match>
 </matches>
</unit>

Such shortcut could be used for all similar annotations.

Actually, another way to look at it is to say that <match> can apply to its <unit>, a <segment> or a <mrk> using some composite notation like this:

<match id='m1' scope='unit'>...</match>
<match id='m2' scope='segment:seg1'>...</match>
<match id='m3' scope='mrk:id1'>...</match>

Whatever the notation, the idea is to make <match> follow a pattern that we can re-use with other features. This should also simplify the implementation: A tool that would support such representation for <match> would have most of the code it needs to support similar annotation features.

Cheers,
-yves

Follow-Ups:
- RE: [xliff] Generic mechanism for translation candidate elements and other annotations
  - From: "Lieske, Christian" <christian.lieske@sap.com>
- RE: [xliff] Generic mechanism for translation candidate elements and other annotations
  - From: "Rodolfo M. Raya" <rmraya@maxprograms.com>
- RE: [xliff] Generic mechanism for translation candidate elements and other annotations
  - From: "Estreen, Fredrik" <Fredrik.Estreen@lionbridge.com>