xliff-seg message

Subject: RE: [xliff-seg] Segmentation representation and scenario
From: Magnus Martikainen <magnus@trados.com>
To: Yves Savourel <ysavourel@translate.com>, XLIFF Segmentation <xliff-seg@lists.oasis-open.org>
Date: Fri, 14 May 2004 11:18:07 -0700
Hi Yves,

Thanks for your suggestion! There are definitely quite a few interesting
benefits with this implementation option. 
I have a few comments:

1) Logically there is a strong coupling between the source and target
version of a segment (the <mrk> in the <source> and the corresponding <mrk>
in the <target>). This connection is in my opinion not explicit enough in
your suggested representation, since it cannot be represented in the schema.
At least to my knowledge there is no way to define a schema or DTD to
validate the file to ensure that each segment in the <source> has a
corresponding segment in the <target> and vice versa.

2) There are a number of properties that apply to individual source and
target segment pairs, which cannot easily be represented in a natural way
that shows their connection to the segments. 
a) Of the current attributes available for <trans-unit> at least the
following would be useful to have also for segments: approved, translate,
phase-name.
b) Alternative translations, fuzzy matches, etc. as represented by
<alt-trans> for the <trans-unit>. Though would be possible to use
<alt-trans> on the <trans-unit> level to store these matches that would be
misleading, as they don't apply to the entire <trans-unit>. For example an
<alt-trans> can be an exact match for a segment, but it cannot be marked as
an exact match because that would be interpreted as an exact match for the
<trans-unit>.

3) If <g> is used inside the <trans-unit> and spans a segment boundary we
cannot use this implementation, as it would violate XML. Instead we would
need to resort to using empty <mrk/> elements to show the start and end of
the segments, perhaps something like this:

<trans-unit id="9">
  <source xml:lang="en"><mrk mid="9-1" mtype="segment-start"/>This is <g>the
first sentence. <mrk mid="9-1" mtype="segment-end"/> <mrk mid="9-2"
mtype="segment-start"/>Second part</g> of the segment. <mrk mid="9-2"
mtype="segment=end"/></source>
  ...
</trans-unit>

This works, but it is not pretty. It makes the coupling between segment
markers even looser, both in source and target. With this representation it
is not even possible to use a schema to validate that each segment start has
a corresponding segment end.

The use of <g> will cause problems for segmentation whichever representation
we choose, and it may in fact turn out that this representation is one of
the few that can handle it in some way at all...

Cheers,
Magnus

-----Original Message-----
From: Yves Savourel [mailto:ysavourel@translate.com] 
Sent: Tuesday, May 11, 2004 7:17 AM
To: XLIFF Segmentation
Subject: [xliff-seg] Segmentation representation and scenario

Some ideas on segmentation representation:

For representing the segmentation inside a <trans-unit> I would use the
<mrk> element:

<trans-unit id='2'>
 <source xml:lang='en'><mrk mid='2-1' mtype='phrase'>This is the second
entry of the file.</mrk>
<mrk mid='2-2' mtype='phrase'>This is the second sentence of the second
entry.</mrk></source>
 <target xml:lang='fr'><mrk mid='2-1' mtype='phrase'>Ceçi est la première
entrée du fichier.</mrk>
<mrk mid='2-1' mtype='phrase'>Ceçi est la seconde phrase de la première
entrée.</mrk></target>
</trans-unit>

- It's part of the existing specifications.
- It's un-intrusive: mergers are suppose to ignore it.
- We can have a set of specific extended attributes if we want to store
sentence-level information.
- We would probably need to add a mtype value specific for a 'segment'
('phrase' is not good enough).

I agree that translation tools should be able to provide there own
segmentation within a <trans-unit> and that during the translation itself
(by the translator).

I also think that a translation tool should be able to use any existing
match at the <trans-unit> level as well: there is no reason to go to a finer
granularity if a match is already available at the <trans-unit> level. This
said, there is obviously a threshold of usability for fuzzy matches at the
<trans-unit> level. And that threshold is most likely commensurable to the
size of the text in the <trans-unit> (as for large units the differences
between the new source and the old one may be more difficult to see).

I think a translation process should be able to take advantage of such high
matches obtained without the translation tool and without segmentation of
the <trans-unit> content. Translation tools should allow the verification of
such matches during the translation.

For example: one can imagine a project where version 2 of a software is to
be localized. A version 1 with translation exists, but no TM. One can easily
create a "TM" without complexe tools for <trans-unit> level entries. One
should be able to re-use high matches of that "TM" regardless what
segmentation is use by the translation tools.


Cheers,
-yves