xliff message

Subject: RE: [xliff] Generic mechanism for translation candidate elements and other annotations
From: "Rodolfo M. Raya" <rmraya@maxprograms.com>
To: <xliff@lists.oasis-open.org>
Date: Wed, 7 Mar 2012 08:19:43 -0200
Hi Yves,

The effect of re-segmentation over matches is not new. This time we have to add processing expectations that require updating matches according to the changes.

Adding matches to a segment should not alter source text. Putting <mrk> elements in <source> to indicate the section that has matches is really ugly. The complexity added to a <source> element that has multiple overlapping sub-segment matches would be annoying.

There may be a need to know what section of <source> is being matched and the relevant information should live in the corresponding <match> element, keeping the original <source> clean. This can be done, for example, by using 2 attributes: one attribute indicates the offset where the match starts and the other indicates the length of the text matched (in both cases ignoring tags).

For example:

<segment>
  <source>white, red, green, yellow, blue, black</source>
  <matches>
    <match mstart="7" mlength="3">
      <source>red</source>
      <target>rojo</target>
    </match>
    <match mstart="7" mlength="10">
      <source>red and green</source>
      <target>rojo y verde</target>
    </match>
    <match mstart="12" mlength="13">
      <source>green and yellow</source>
      <target> verde y amarillo</target>
    </match>
  </matches>
</segment>

The example above shows sub-segment matches that are overlapping and provide information on the regions that are matched, keeping source text clean.

If the <segment> is partitioned, the tool doing so would have to adjust the starting point of the match and the length of the fragment being matched if necessary.


Regards,
Rodolfo
--
Rodolfo M. Raya       rmraya@maxprograms.com
Maxprograms       http://www.maxprograms.com

> -----Original Message-----
> From: xliff@lists.oasis-open.org [mailto:xliff@lists.oasis-open.org] On Behalf
> Of Yves Savourel
> Sent: Wednesday, March 07, 2012 1:07 AM
> To: xliff@lists.oasis-open.org
> Subject: [xliff] Generic mechanism for translation candidate elements and
> other annotations
> 
> Hi everyone,
> 
> The current proposal for translation candidates calls for a fairly simple
> structure where a <matches> element that holds a list of <match> and can
> be associated to a <segment>, or a <unit>.
> 
> However, at the light of the following two issues, I'm not sure if this is
> enough:
> 
> 
> a) The first issue is that segments can be re-segmented.
> 
> This means some <match> may become invalid and should be removed or
> somehow flagged or their score modified to indicate they don't correspond
> any more to the segment they are attached to.
> 
> For example, initial entry:
> 
> <unit id="1">
>  <segment>
>   <source>Some text: and more</source>
>   <matches>
>    <match>
>     <source>Some text: and more</source>
>     <target>Du texte : et plus</target>
>    </match>
>   </matches>
>  </segment>
> </unit>
> 
> Then after re-segmentation, we'll have to decide what to do with the match:
> 
> <unit id="1">
> <segment>
>   <source>Some text: </source>
>   <matches>
>    <match> <!-- Not a good match any more -->
>     <source>Some text: and more</source>
>     <target>Du texte : et plus</target>
>    </match>
>   </matches>
>  </segment>
>  <segment>
>   <source>and more</source>
>  </segment>
> </unit>
> 
> 
> b) The second issue is that, nowadays, translation candidates are not just for
> segments.
> 
> More and more tools provide phrase-level matches (sub-sentence matches).
> The current mechanism does not handle such cases.
> 
> 
> But more importantly, in addition to these two issues, the case of <match> is
> an illustration of a more general challenge that XLIFF 2.0 needs to tackle in a
> consistent way: annotations.
> 
> More and more processes are be able to 'enriched' the extracted document
> with information pertaining to a span of the content (which may or may not
> correspond to a segment). Translation candidates are just one case among
> many. Attaching matches to a source content is not very different from
> associating QA errors to a chunk of text, or labeling a phrase with a translator
> comment, etc.
> 
> I think we need to have a common pattern to implement such features. This
> may allow us to have also common processing expectations and address at
> the core level potential problems with modules/extensions that follow the
> same pattern.
> 
> 
> To go back to our <match> example: One possibly way to solve this could be
> to link a <match> not to a <segment> or a <unit> but to an <mrk>.
> 
> <unit id="1">
>  <segment>
>   <source><mrk id='1' type='match' ref='m1'>Some text: and
> more</mrk></source>  </segment>  <matches>
>   <match id='m1'>
>    <source>Some text: and more</source>
>    <target>Du texte : et plus</target>
>   </match>
>  </matches>
> </unit>
> 
> After re-segmentation the match is still valid.
> 
> <unit id="1">
>  <segment>
>   <source><sm id='1' type='match' ref='m1'/>Some text: </source>
> </segment>  <segment>
>   <source>and more<em rid='1'/></source>  </segment>  <matches>
>   <match id='m1'>
>    <source>Some text: and more</source>
>    <target>Du texte : et plus</target>
>   </match>
>  </matches>
> </unit>
> 
> [Note: <sm/> and <em/> are just a way to represent a broken <mrk>, (like
> <sc/> and <ec/> for <pc>). The Inline SC has not worked out completely how
> to represent this, but the bottom line is that we'll have some representation
> of non-well-formed <mrk>.]
> 
> 
> The drawback of using <mrk> for <match> is obviously the added complexity
> when the span associated with the translation candidate corresponds to an
> entire segment (which is most of the cases).
> 
> I suppose we could imagine some well-defined 'shortcut' way to declare that
> a <mrk> that spans the full content of a <segment> is linked to its <match> in
> some implicit way and can be omitted.
> 
> For example: <match id='m1' segment='seg1'>...</match> when the match
> m1 is associated with the entire content of the segment seg1. For example:
> 
> <unit id="1">
>  <segment id='seg1'>
>   <source>Some text: and more</source>
>  </segment>
>  <matches>
>   <match id='m1' segment='seg1'>
>    <source>Some text: and more</source>
>    <target>Du texte : et plus</target>
>   </match>
>  </matches>
> </unit>
> 
> Such shortcut could be used for all similar annotations.
> 
> Actually, another way to look at it is to say that <match> can apply to its
> <unit>, a <segment> or a <mrk> using some composite notation like this:
> 
> <match id='m1' scope='unit'>...</match>
> <match id='m2' scope='segment:seg1'>...</match> <match id='m3'
> scope='mrk:id1'>...</match>
> 
> Whatever the notation, the idea is to make <match> follow a pattern that we
> can re-use with other features. This should also simplify the implementation:
> A tool that would support such representation for <match> would have most
> of the code it needs to support similar annotation features.
> 
> Cheers,
> -yves
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xliff-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: xliff-help@lists.oasis-open.org
References:
- Generic mechanism for translation candidate elements and other annotations
  - From: Yves Savourel <ysavourel@enlaso.com>