xliff-seg message

Subject: RE: [xliff-seg] Segmentation and filters
From: Magnus Martikainen <magnus@trados.com>
To: Andrzej Zydron <azydron@xml-intl.com>, xliff-seg@lists.oasis-open.org
Date: Wed, 24 Mar 2004 18:34:51 -0800
Hi all,

Thanks very much Andrzej for your clear and structured arguments and
examples, this is very useful for further discussions on this topic.

Here are my comments to this thread:

1) I agree that segmentation should not be mandatory. 
However I am also of the opinion that segmentation should always be allowed,
whether the original content was supposedly segmented during extraction or
not. The reason is that detection of best possible segment boundaries may
still need to be adapted to best fit the tools and the translation memories
used during the localisation, which could use slightly different
segmentation. (A common example would be handling of tag at sentence
boundaries etc.) Since the goal for the user must be to achieve maximum
leverage from their translation memory resources it may be necessary to
adjust segmentation also in such cases.
As a side effect of this, if we agree that we always want to "allow"
segmentation of the content I see no need for an explicit
segmented="true/false" attribute.


2) I can think of a couple of situations where a "double conversion" as you
are suggesting would cause problems:

a) It applies new "hard" boundaries to the segments. Thus segmentation
cannot be changed later, e.g. during interactive translation. Sometimes it
is necessary or desirable to change the default segmentation to accommodate
translation needs while working on the document. Examples include:
- the need to adjust segmentation that has been incorrectly applied (e.g. an
abbreviation in the middle of a sentence that has been wrongly interpreted
by the segmentation tool as the end of that sentence).
- the occasional need to translate two or more source sentences into one
target language sentence for it to be a meaningful translation


b) During backward conversion of the "doubly converted" XLIFF file to its
original XLIFF format the segment boundaries are lost.
If changes are made to the content of the XLIFF file after it has been
converted back to its original XLIFF format it is no longer possible to get
those changes back into the "double converted" XLIFF document, e.g. in order
to update a translation memory with those changes. Once converted back, the
segment boundaries in both the source and target segments are gone. (The
source segmentation can perhaps be re-created, but it is no longer possible
to with certainty determine the correct corresponding target segments.)

Example: If the "working" XLIFF file (the segmented version which is used to
interact with the translation memory during translation) after translation
contains this:

<trans-unit id="1.1">
  <source xml:lang="en-US">Long sentence.</source>
  <target xml:lang="sv-SE" state="translated">En lång mening.</target>
</trans-unit>
<trans-unit id="1.2">
  <source xml:lang="en-US">Short sentence.</source>
  <target xml:lang="sv-SE" state="translated">Kort mening.</target>
</trans-unit>

Both of these trans-units belong to the same <trans-unit> in the original
XLIFF file, and when the XLIFF file is converted back to its original XLIFF
format it could look like this (depending on the content of the skeleton
file):

<trans-unit id="1">
  <source xml:lang="en-US">Long sentence. Short sentence.</source>
  <target xml:lang="sv-SE" state="translated">En lång mening. Kort
mening.</target>
</trans-unit>

Now someone decides in the last minute that the translation needs to be
changed - the long sentence is for some reason better translated as two
sentences. This change is approved and signed-off. The XLIFF file is changed
into:

<trans-unit id="1">
  <source xml:lang="en-US">Long sentence. Short sentence.</source>
  <target xml:lang="sv-SE" state="final">Lång mening. Mer mening. Kort
mening.</target>
</trans-unit>

Unfortunately there is no way to easy way to update the translation memory
with these changes since the original segment boundaries that were used
during translation were lost. 
Tools can of course try to automatically "guess" the segment boundaries in
the source and target and somehow match them up, but this is not a trivial
task as can be seen from this example. There is no way an automatic tool can
determine if the middle sentence in the modified target should be paired
with the first or the last sentence (if either). Thus there is no way to
safely update the translation memory automatically with these changes.


c) The converted XLIFF file looses its "identity" or its direct connection
with the underlying data format.
Tools that have been developed to specifically process a particular file
type wrapped in XLIFF can not be used on the "converted" XLIFF file since:
- the original skeleton is no longer available/usable
- some of the content in the original XLIFF file (in particular tags between
sentences) has moved into the new skeleton.
- the new skeleton has been created with a tool and process unknown to any
other XLIFF tools.

A typical example of a tool that would be useful to be able to run during
the localisation process is a verification/validation tool to ascertain that
the translated content can be converted back to a valid original format.
Examples of validation tools:
- Tag verification, validation against the schema, DTD, or other rules that
the content must adhere to.
- Length verification to ensure that translated content does not exceed
length limitations (which may be specified explicitly in the XLIFF file).
Both of these tasks require dealing with the underlying native data that the
XLIFF file wraps in order to perform their jobs. Due to the reasons stated
above the "doubly converted" XLIFF files cannot be used for this.


d) An additional level of unnecessary complexity is introduced, since it is
necessary to do an additional transformation/conversion of the XLIFF
document before it can be processed by the filter that created it.
In something as complex as a typical localisation project this is not a
factor to be neglected. If an average project has 100 files translated into
10 languages that means an additional 1000 file conversions necessary to
complete the project. If the workflow for this is not entirely automated it
could mean that someone may need to use a tool to manually check each
individual file to determine which state it is in before the files can be
delivered or further processed.
If other tools used in the localisation process use the same approach of
converting the XLIFF file to a new XLIFF format the complexity is
multiplied... All this can be avoided if we support the notion of segments
directly in XLIFF - then the very same XLIFF file can be used in all stages
of the process.


Looking forward to further comments and discussions on this topic!

Best regards,
Magnus Martikainen
TRADOS Inc.

-----Original Message-----
From: Andrzej Zydron [mailto:azydron@xml-intl.com] 
Sent: Tuesday, March 23, 2004 2:07 PM
To: xliff-seg@lists.oasis-open.org
Subject: [xliff-seg] Segmentation and filters

Hi,

First of all I would like to thank Magnus for the hard work he has put 
in so far and the detailed document that he has prepared. This has 
provided a clear starting point for further discussions.

To kick off this thread I would like to state my views on the 
segmentation issue:

1) Segmentation within XLIFF should not be mandated. It should be 
optional. There are implementations such as xml:tm where segmentation is 
done before extraction. It is also quite easy to envisage situations 
where XLIFF is the output of an existing translation workbench system 
that has already segmented and pre-matched data for sending out to a 
translator who will import it into an XLIFF aware editing environment.

I can also see Magnus' point that quite often XLIFF will contain 
unsegmented data.

One solution would be to provide an optional "segmented" attribute at 
the <file> element level which states that the data has already been
segmented, with a default value of "false". If the data has been 
segmented than an xlink attribute to the SRX url could also be provided.

2) One way of handling segmentation within XLIFF is to create a 
secondary XLIFF document from the current XLIFF document that has a 
separate <trans-unit> element for each segment. This would effectively 
be an segmentation extraction of the original XLIFF document. This has 
the one significant advantage that no further extensions are required to 
the XLIFF standard. It does away with all the potential complexity of 
trying to nest <trans-unit> elements or add workable syntax to cope with 
multiple source and target segments within a <trans-unit>.

Because XLIFF is a well defined XML format it is very easy to write an 
extraction + segmentation filter for it to provide an XLIFF file where 
the <trans-unit> elements are at the segment level, along with a 
skeleton file for merging back.

After translation you can elect to store leveraged memory at both the 
segmented and unsegmeted levels.

Here is an example based on Magnus' data:

Step 1: Original XLIFF file:

<body>
   <trans-unit id="1">
     <source xml:lang="en-US">The Document Title</source>
   </trans-unit>
   <trans-unit id="2">
     <source xml:lang="en-US">First sentence. <bpt 
id="1">[ITALIC:</bpt>This is an important sentence.<ept 
id="1">]</ept></source>
   </trans-unit>
   <trans-unit id="3">
     <source xml:lang="en-US">Ambiguous sentence. More <bpt 
id="1">[LINK-to-toc:</bpt>content<ept id="1">]</ept>.</source>
   </trans-unit>
</body>

Step 2: Introduce namespace segmentation into XLIFF file

<body xmlns:tm="http://www.xml-intl.com/dtd/tm.xsd";>
   <trans-unit id="1">
     <source xml:lang="en-US"><tm:tu id="1.1">The Document 
Title</tm:tu></source>
   </trans-unit>
   <trans-unit id="2">
     <source xml:lang="en-US"><tm:tu id="2.1">First sentence.</tm:tu> 
<bpt id="1">[ITALIC:</bpt><tm:tu id="2.2">This is an important 
sentence.</tm:tu><ept id="1">]</ept></source>
   </trans-unit>
   <trans-unit id="3">
     <source xml:lang="en-US"><tm:tu id="3.1">Ambiguous 
sentence.</tm:tu> <tm:tu id="3.2">More <bpt 
id="1">[LINK-to-toc:</bpt>content<ept id="1">]</ept>.</tm:tu></source>
   </trans-unit>
</body>

Step 3: Using a simple XSLT transformation create new segmented XLIFF file:

<body segmented="true" srx="http://www.xml-intl.com/srx/en-US.srx";>
   <trans-unit id="1.1">
     <source xml:lang="en-US">The Document Title</source>
   </trans-unit>
   <trans-unit id="2.1">
     <source xml:lang="en-US">First sentence.</source>
   </trans-unit>
   <trans-unit id="2.2">
     <source xml:lang="en-US">This is an important sentence.</source>
   </trans-unit>
   <trans-unit id="3.1">
     <source xml:lang="en-US">Ambiguous sentence.</source>
   </trans-unit>
   <trans-unit id="3.1">
     <source xml:lang="en-US">More <bpt 
id="1">[LINK-to-toc:</bpt>content<ept id="1">]</ept>.</source>
   </trans-unit>
</body>

And Skeleton file:

<body xmlns:tm="http://www.xml-intl.com/dtd/tm.xsd";>
   <trans-unit id="1">
     <source xml:lang="en-US"><tm:tu id="1.1"><ext 
id="1.1"/></tm:tu></source>
   </trans-unit>
   <trans-unit id="2">
     <source xml:lang="en-US"><tm:tu id="2.1"><ext id="2.1"/></tm:tu> 
<bpt id="1">[ITALIC:</bpt><tm:tu id="2.2"><ext id="2.2"/></tm:tu><ept 
id="1">]</ept></source>
   </trans-unit>
   <trans-unit id="3">
     <source xml:lang="en-US"><tm:tu id="3.1"><ext id="3.1"/></tm:tu> 
<tm:tu id="3.2"><ext id="3.2"/></tm:tu></source>
   </trans-unit>
</body>

Step 3: Put segmented XLIFF file through whatever matching process you 
want to, to produce:

<body segmented="true" srx="http://www.xml-intl.com/srx/en-US.srx";>
   <trans-unit id="1.1">
     <source xml:lang="en-US">The Document Title</source>
     <target xml:lang="sv-SE" state="translated" 
state-qualifier="leveraged-tm">Dokumentrubriken</target>
   </trans-unit>
   <trans-unit id="2.1">
     <source xml:lang="en-US">First sentence.</source>
     <target xml:lang="sv-SE" state="translated" 
state-qualifier="leveraged-tm">Första meningen.</target>
   </trans-unit>
   <trans-unit id="2.2">
     <source xml:lang="en-US">This is an important sentence.</source>
       <alt-trans origin="transation memory" match-quality="80%">
         <source xml:lang="en-US">This is an extremely important 
sentence.</source>
         <target xml:lang="sv-SE">En mycket viktig mening.</target>
       </alt-trans>
   </trans-unit>
   <trans-unit id="3.1">
     <source xml:lang="en-US">Ambiguous sentence.</source>
     <target xml:lang="sv-SE" state="needs-review-translation">Omstridd 
mening.</target>
       <note annotates="target" from="Swedish Translator">This 
translation may not be appropriate. Please evaluate it carefully!</note>
   </trans-unit>
   <trans-unit id="3.1">
     <source xml:lang="en-US">More <bpt 
id="1">[LINK-to-toc:</bpt>content<ept id="1">]</ept>.</source>
     <taget xml:lang="sv-SE" state="translated">Ytterligare <bpt 
id="1">[LINK-to-toc:</bpt>inneha*ll<ept id="1">]</ept>.</target>
   </trans-unit>
</body>


Step 4: Using nothing more than XSLT, merge the translated document 
back, then strip out the segmented namespace elements using another 
simple XSLT transformation and you arrive at a translated XLIFF file 
that is equal to the original source language unsegmented file.

This approach has the benefit of requiring minimal or possibly no change 
to the existing excellent XLIFF specification.

Hope this helps kick off the thread.

Regards,

AZ

-- 


email - azydron@xml-intl.com
smail - Mr. A.Zydron
         24 Maybrook Gardens,
         High Wycombe,
         Bucks HP13 6PJ
Mobile +(44) 7966 477181
FAX    +(44) 870 831 8868
www - http://www.xml-intl.com

This message contains confidential information and is intended only
for the individual named.  If you are not the named addressee you
may not disseminate, distribute or copy this e-mail.  Please
notify the sender immediately by e-mail if you have received this
e-mail by mistake and delete this e-mail from your system.
E-mail transmission cannot be guaranteed to be secure or error-free
as information could be intercepted, corrupted, lost, destroyed,
arrive late or incomplete, or contain viruses.  The sender therefore
does not accept liability for any errors or omissions in the contents
of this message which arise as a result of e-mail transmission.  If
verification is required please request a hard-copy version. Unless
explicitly stated otherwise this message is provided for informational
purposes only and should not be construed as a solicitation or offer.
Follow-Ups:
- Re: [xliff-seg] Segmentation and filters
  - From: Andrzej Zydron <azydron@xml-intl.com>