[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: RE: [xliff] Generic mechanism for translation candidate elements and other annotations
Hi Yves, The effect of re-segmentation over matches is not new. This time we have to add processing expectations that require updating matches according to the changes. Adding matches to a segment should not alter source text. Putting <mrk> elements in <source> to indicate the section that has matches is really ugly. The complexity added to a <source> element that has multiple overlapping sub-segment matches would be annoying. There may be a need to know what section of <source> is being matched and the relevant information should live in the corresponding <match> element, keeping the original <source> clean. This can be done, for example, by using 2 attributes: one attribute indicates the offset where the match starts and the other indicates the length of the text matched (in both cases ignoring tags). For example: <segment> <source>white, red, green, yellow, blue, black</source> <matches> <match mstart="7" mlength="3"> <source>red</source> <target>rojo</target> </match> <match mstart="7" mlength="10"> <source>red and green</source> <target>rojo y verde</target> </match> <match mstart="12" mlength="13"> <source>green and yellow</source> <target> verde y amarillo</target> </match> </matches> </segment> The example above shows sub-segment matches that are overlapping and provide information on the regions that are matched, keeping source text clean. If the <segment> is partitioned, the tool doing so would have to adjust the starting point of the match and the length of the fragment being matched if necessary. Regards, Rodolfo -- Rodolfo M. Raya rmraya@maxprograms.com Maxprograms http://www.maxprograms.com > -----Original Message----- > From: xliff@lists.oasis-open.org [mailto:xliff@lists.oasis-open.org] On Behalf > Of Yves Savourel > Sent: Wednesday, March 07, 2012 1:07 AM > To: xliff@lists.oasis-open.org > Subject: [xliff] Generic mechanism for translation candidate elements and > other annotations > > Hi everyone, > > The current proposal for translation candidates calls for a fairly simple > structure where a <matches> element that holds a list of <match> and can > be associated to a <segment>, or a <unit>. > > However, at the light of the following two issues, I'm not sure if this is > enough: > > > a) The first issue is that segments can be re-segmented. > > This means some <match> may become invalid and should be removed or > somehow flagged or their score modified to indicate they don't correspond > any more to the segment they are attached to. > > For example, initial entry: > > <unit id="1"> > <segment> > <source>Some text: and more</source> > <matches> > <match> > <source>Some text: and more</source> > <target>Du texte : et plus</target> > </match> > </matches> > </segment> > </unit> > > Then after re-segmentation, we'll have to decide what to do with the match: > > <unit id="1"> > <segment> > <source>Some text: </source> > <matches> > <match> <!-- Not a good match any more --> > <source>Some text: and more</source> > <target>Du texte : et plus</target> > </match> > </matches> > </segment> > <segment> > <source>and more</source> > </segment> > </unit> > > > b) The second issue is that, nowadays, translation candidates are not just for > segments. > > More and more tools provide phrase-level matches (sub-sentence matches). > The current mechanism does not handle such cases. > > > But more importantly, in addition to these two issues, the case of <match> is > an illustration of a more general challenge that XLIFF 2.0 needs to tackle in a > consistent way: annotations. > > More and more processes are be able to 'enriched' the extracted document > with information pertaining to a span of the content (which may or may not > correspond to a segment). Translation candidates are just one case among > many. Attaching matches to a source content is not very different from > associating QA errors to a chunk of text, or labeling a phrase with a translator > comment, etc. > > I think we need to have a common pattern to implement such features. This > may allow us to have also common processing expectations and address at > the core level potential problems with modules/extensions that follow the > same pattern. > > > To go back to our <match> example: One possibly way to solve this could be > to link a <match> not to a <segment> or a <unit> but to an <mrk>. > > <unit id="1"> > <segment> > <source><mrk id='1' type='match' ref='m1'>Some text: and > more</mrk></source> </segment> <matches> > <match id='m1'> > <source>Some text: and more</source> > <target>Du texte : et plus</target> > </match> > </matches> > </unit> > > After re-segmentation the match is still valid. > > <unit id="1"> > <segment> > <source><sm id='1' type='match' ref='m1'/>Some text: </source> > </segment> <segment> > <source>and more<em rid='1'/></source> </segment> <matches> > <match id='m1'> > <source>Some text: and more</source> > <target>Du texte : et plus</target> > </match> > </matches> > </unit> > > [Note: <sm/> and <em/> are just a way to represent a broken <mrk>, (like > <sc/> and <ec/> for <pc>). The Inline SC has not worked out completely how > to represent this, but the bottom line is that we'll have some representation > of non-well-formed <mrk>.] > > > The drawback of using <mrk> for <match> is obviously the added complexity > when the span associated with the translation candidate corresponds to an > entire segment (which is most of the cases). > > I suppose we could imagine some well-defined 'shortcut' way to declare that > a <mrk> that spans the full content of a <segment> is linked to its <match> in > some implicit way and can be omitted. > > For example: <match id='m1' segment='seg1'>...</match> when the match > m1 is associated with the entire content of the segment seg1. For example: > > <unit id="1"> > <segment id='seg1'> > <source>Some text: and more</source> > </segment> > <matches> > <match id='m1' segment='seg1'> > <source>Some text: and more</source> > <target>Du texte : et plus</target> > </match> > </matches> > </unit> > > Such shortcut could be used for all similar annotations. > > Actually, another way to look at it is to say that <match> can apply to its > <unit>, a <segment> or a <mrk> using some composite notation like this: > > <match id='m1' scope='unit'>...</match> > <match id='m2' scope='segment:seg1'>...</match> <match id='m3' > scope='mrk:id1'>...</match> > > Whatever the notation, the idea is to make <match> follow a pattern that we > can re-use with other features. This should also simplify the implementation: > A tool that would support such representation for <match> would have most > of the code it needs to support similar annotation features. > > Cheers, > -yves > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: xliff-unsubscribe@lists.oasis-open.org > For additional commands, e-mail: xliff-help@lists.oasis-open.org
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]