xliff message

Subject: RE: [xliff] Re-segmentation
From: "Estreen, Fredrik" <Fredrik.Estreen@lionbridge.com>
To: Yves Savourel <ysavourel@enlaso.com>, 'XLIFF Main List' <xliff@lists.oasis-open.org>
Date: Mon, 17 Jun 2013 08:10:19 +0000
Hi Yves, Ryan,

After getting some more time to think about this I'm no longer convinced that using <mrk> to markup sections of text will work well for many use cases where we also need to annotate <target> content. My fear is that it will be very hard to propagate markup from source to target in automatic processing. And likewise it will be time consuming for the translator to do manually, driving up cost of translation of such material.

Consider this example where we go from sub sentence segmentation to sentence segmentation:

<unit>
  <segment>
    <source><mrk id="1">Joe read the book,</mrk></source>
  </segment>
  <ignorable>
    <source> </source>
  </ignorable>
  <segment>
    <source><mrk id="2">but his friend saw the movie.</mrk></source>
  </segment>
</unit>

Is transformed into:

<unit>
  <segment>
    <source><mrk id="1">Joe read the book,</mrk> <mrk id="2">but his friend saw the movie.</mrk></source>
  </segment>
</unit>

When translating the new segment we need to somehow position the <mrk> elements around appropriate subsets in target. For some things it might not matter too much where we put them for others it will be critical. For example a validation rule would need to be around the target portion actually corresponding to the marked up source to be meaningful at all. For change tracking it might not be functionally important (all translatable text is tracked anyway). But the value of the tracking information could be reduced if it does not track the same semantic part of source and target. In the above example the coma might have gone missing in some languages or a lower quality TM match complicating finding the right midpoint to use.

Here is a Swedish translation without the coma but with the <mrk>'s correctly placed. The coma should be present in Swedish according to pure grammatical rules, but there is a shift away from that to a looser set of rules around general readability for coma usage. So we assume a translator left it out. It is simple for a human to place the <mrk> correctly but takes extra time. For a TM matching system it would be impossible without; semantic knowledge about source and target languages, <mrk>'s already in the TM or additional sub segment matches.

<unit>
  <segment>
    <source><mrk id="1">Joe read the book,</mrk> <mrk id="2">but his friend saw the movie.</mrk></source>
    <target><mrk id="1">Joe läste boken</mrk> <mrk id="2">men hans kompis såg filmen.</mrk></target>
  </segment>
</unit>

If we allow markup that need to be linked between <source> and <target> at the segment level, moving the markup from <segment> to <mrk> makes it technically possible to re-segment. But it would still be somewhere between hard and impossible in practice for machine processes to get it right. Perhaps that is not a big issue and we would in those instances just rely on manual placement after the automatic process, but this seem like going against the current trend of more doing automated processing.

Regards,
Fredrik Estreen

> -----Original Message-----
> From: xliff@lists.oasis-open.org [mailto:xliff@lists.oasis-open.org] On Behalf
> Of Yves Savourel
> Sent: den 13 juni 2013 07:15
> To: 'XLIFF Main List'
> Subject: RE: [xliff] Re-segmentation
> 
> Hi Ryan, all,
> 
> I'm trying to see any drawbacks to the proposal.
> As a transport/exchange format I don't see why this would not work.
> 
> Thinking about import/export from/to a tool: I suppose some tools will have
> to break down the unique marker into several if their internal annotation
> model supports only one annotation per marker, so that may make the code
> a bit more tricky (and for output too).
> But that is not a big issue.
> 
> As long as it such representation is not a must but just a possible notation
> that should be ok.
> 
> So we would have to add an extra pre-define type of annotation for mrk:
> 'ref' or 'references'.
> 
> The only issue I see is the redundancy with the normal ref attribute of mrk.
> When you have a single reference to place, what do you use?
> 
> <mrk id='1' type='ctr:changeTrack' ref='#c1'> Or <mrk id='1' type='ref'
> ctr:changeTrackID="c1" >
> 
> I would also use a name like ctr:ref rather than ctr:changeTrackID as the
> attribute value is a reference to the ID of the block of info rather than an ID.
> 
> Also: should the block of information have a reference to the marker? In the
> current proposal you have to be on the mrk to know where to get the info.
> But it's more complicated to know where is the marker from the block of info
> (you can't use the ID mechanism since ctr:changeTrackID cannot be both a
> reference and an ID (you would have duplicated ID values) You can obviously
> always get to the mrk using XPath rather than the id() function, so maybe
> that is not an issue.
> 
> Just thinking aloud...
> -ys
> 
> 
> -----Original Message-----
> From: Ryan King [mailto:ryanki@microsoft.com]
> Sent: Wednesday, June 12, 2013 4:32 PM
> To: Yves Savourel; XLIFF Main List
> Subject: RE: [xliff] Re-segmentation
> 
> After our panel discussion today at the symposium and trying to visualize
> this, I think we may be over-complicating the structure using annotations to
> point to modules that contain segment-level metadata. For example, here is
> what we have defined today in the
> spec:
> 
> <unit>
>   <segment id="1">
>     <source>Hello World. Hello World 2.</source>
>     <target>Hello World. Hello World 2.</target>
>     <ctr:changeTrack>...</ctr:changeTrack>
>     <mda:metadata">...</mda:metadata>
>     <val:validation>...</val:validation>
>   </segment>
> </unit>
> 
> And the same thing using annotations after re-segmenting in the way I think
> we've been discussing it, where maybe the second segment needs
> validation, but the first doesn't, but they both need metadata and they both
> need change tracking.
> 
> <unit>
>   <segment 1d="1">
>     <source><mrk id="1" type="changeTrack" ref="#c1"><mrk id="2"
> type="metadata" ref="#m1"><mrk id="3" type="validation"
> ref="#v1">Hello World.</mrk></mrk></mrk></source>
>     <target><mrk id="1" type="changeTrack" ref="#c1"><mrk id="2"
> type="metadata" ref="#m1"><mrk id="3" type="validation"
> ref="#v1">Hello World.</mrk></mrk></mrk></target>
>   </segment>
>   <segment id="2">
>     <source><mrk id="1" type="changeTrack" ref="#c2"><mrk id="2"
> type="metadata" ref="#m2">Hello World 2.</mrk></mrk></source>
>     <target><mrk id="1" type="changeTrack" ref="#c2"><mrk id="2"
> type="metadata" ref="#m2">Hello World 2.</mrk></mrk></target>
>   </segment>
>   <ctr:changeTrack id="c1">...</ctr:changeTrack>
>   <mda:metadata id="m1">...</mda:metadata>
>   <val:validation id="v1">...</val:validation>
>   <ctr:changeTrack id="c2">...</ctr:changeTrack>
>   <mda:metadata id="m2">...</mda:metadata>
>   <val:validation id="v3">...</val:validation> </unit>
> 
> Right away, as Yves pointed out, that is a lot of <mrk> elements (and there
> would potentially be more with matches, etc.) surrounding the actual source
> and target text. Also, it is ambiguous, because it looks like I have <mrk>
> elements embedded in other <mrk> elements and this is technically not the
> case. Maybe it would make more sense to have each module, or extension,
> with segment-level metadata, define an attribute that could be used in a
> custom annotation for referencing. For example, something like a custom
> "reference" annotation:
> 
> <unit>
>   <segment 1d="1">
>     <source><mrk id="1" type="reference" ctr:changeTrackID="c1"
> mda:metadataID="m1" val:validationID="v1" translate="yes">Hello
> World</mrk></source>
>     <target><mrk id="1" type="reference" ctr:changeTrackID="c1"
> mda:metadataID="m1" val:validationID="v1" translate="yes">Hello
> World</mrk></target>
>   </segment>
>   <segment id="2">
>     <source ><mrk id="2" type="reference" ctr:changeTrackID="c2"
> mda:metadataID="m2" translate="yes">Hello World 2</mrk><source>
>     <target><mrk id="1" type="reference" ctr:changeTrackID="c1"
> mda:metadataID="m1" translate="yes">Hello World</mrk></target>
> 
>   </segment>
>   <ctr:changeTrack id="c1">...</ctr:changeTrack>
>   <mda:metadata id="m1">...</mda:metadata>
>   <val:validation id="v1">...</val:validation>
>   <ctr:changeTrack id="c2">...</ctr:changeTrack>
>   <mda:metadata id="m2">...</mda:metadata>
>   <val:validation id="v3">...</val:validation> </unit>
> 
> What do you think?
> 
> Ryan
> 
> -----Original Message-----
> From: xliff@lists.oasis-open.org [mailto:xliff@lists.oasis-open.org] On Behalf
> Of Yves Savourel
> Sent: Wednesday, June 12, 2013 5:48 AM
> To: XLIFF Main List
> Subject: [xliff] Re-segmentation
> 
> Hi all,
> 
> Thinking more about the different solutions for re-segmentation in 2.0,
> especially about solution #4:
> 
> - We would have to define PRs for the <segment> attributes like translate,
> approved, state, etc.
> Note that translate would logically become a <mrk translate='yes|no'>. Is
> that mean we should always have this info as an <mrk>?
> 
> - We would have to add an id in all top elements like <matches>,
> <changeTrack> and allow multiple of them at the <unit> level.
> 
> - The part that concerns me most is the paradigm shift for developers.
> Traditionally many tools are segment-based and with solution
> #4 they would have to change how many metadata for the segments would
> be stored, and decide what to do with the parts that don't correspond to a
> segment anymore (overlapping <mrk>s and sub-segment <mrk>).
> 
> - We may end up with <segment> containing a lot of <mrk> at both ends. It
> may take some efforts to deal with those. They may have some side effects
> on functions like TM matching, etc.
> 
> I'm still relatively sure that #4 is probably the better representation on the
> long-term, but it is a very big change. So the more feedback before we go
> that way the better. And we really need examples and working
> implementation for this.
> 
> Cheers,
> -yves
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe from this mail list, you must leave the OASIS TC that
> generates this mail.  Follow this link to all your TCs in OASIS
> at:
> https://www.oasis-
> open.org/apps/org/workgroup/portal/my_workgroups.php
> 
> 
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe from this mail list, you must leave the OASIS TC that
> generates this mail.  Follow this link to all your TCs in OASIS at:
> https://www.oasis-
> open.org/apps/org/workgroup/portal/my_workgroups.php
>
Follow-Ups:
- RE: [xliff] Re-segmentation
  - From: "Schnabel, Bryan S" <bryan.s.schnabel@tektronix.com>
References:
- Re-segmentation
  - From: Yves Savourel <ysavourel@enlaso.com>
- RE: [xliff] Re-segmentation
  - From: Ryan King <ryanki@microsoft.com>
- RE: [xliff] Re-segmentation
  - From: Yves Savourel <ysavourel@enlaso.com>