xliff-seg message

Subject: Segmentation and filters
From: Andrzej Zydron <azydron@xml-intl.com>
To: xliff-seg@lists.oasis-open.org
Date: Tue, 23 Mar 2004 22:06:39 +0000
Hi,

First of all I would like to thank Magnus for the hard work he has put 
in so far and the detailed document that he has prepared. This has 
provided a clear starting point for further discussions.

To kick off this thread I would like to state my views on the 
segmentation issue:

1) Segmentation within XLIFF should not be mandated. It should be 
optional. There are implementations such as xml:tm where segmentation is 
done before extraction. It is also quite easy to envisage situations 
where XLIFF is the output of an existing translation workbench system 
that has already segmented and pre-matched data for sending out to a 
translator who will import it into an XLIFF aware editing environment.

I can also see Magnus' point that quite often XLIFF will contain 
unsegmented data.

One solution would be to provide an optional "segmented" attribute at 
the <file> element level which states that the data has already been 
segmented, with a default value of "false". If the data has been 
segmented than an xlink attribute to the SRX url could also be provided.

2) One way of handling segmentation within XLIFF is to create a 
secondary XLIFF document from the current XLIFF document that has a 
separate <trans-unit> element for each segment. This would effectively 
be an segmentation extraction of the original XLIFF document. This has 
the one significant advantage that no further extensions are required to 
the XLIFF standard. It does away with all the potential complexity of 
trying to nest <trans-unit> elements or add workable syntax to cope with 
multiple source and target segments within a <trans-unit>.

Because XLIFF is a well defined XML format it is very easy to write an 
extraction + segmentation filter for it to provide an XLIFF file where 
the <trans-unit> elements are at the segment level, along with a 
skeleton file for merging back.

After translation you can elect to store leveraged memory at both the 
segmented and unsegmeted levels.

Here is an example based on Magnus' data:

Step 1: Original XLIFF file:

<body>
   <trans-unit id="1">
     <source xml:lang="en-US">The Document Title</source>
   </trans-unit>
   <trans-unit id="2">
     <source xml:lang="en-US">First sentence. <bpt 
id="1">[ITALIC:</bpt>This is an important sentence.<ept 
id="1">]</ept></source>
   </trans-unit>
   <trans-unit id="3">
     <source xml:lang="en-US">Ambiguous sentence. More <bpt 
id="1">[LINK-to-toc:</bpt>content<ept id="1">]</ept>.</source>
   </trans-unit>
</body>

Step 2: Introduce namespace segmentation into XLIFF file

<body xmlns:tm="http://www.xml-intl.com/dtd/tm.xsd";>
   <trans-unit id="1">
     <source xml:lang="en-US"><tm:tu id="1.1">The Document 
Title</tm:tu></source>
   </trans-unit>
   <trans-unit id="2">
     <source xml:lang="en-US"><tm:tu id="2.1">First sentence.</tm:tu> 
<bpt id="1">[ITALIC:</bpt><tm:tu id="2.2">This is an important 
sentence.</tm:tu><ept id="1">]</ept></source>
   </trans-unit>
   <trans-unit id="3">
     <source xml:lang="en-US"><tm:tu id="3.1">Ambiguous 
sentence.</tm:tu> <tm:tu id="3.2">More <bpt 
id="1">[LINK-to-toc:</bpt>content<ept id="1">]</ept>.</tm:tu></source>
   </trans-unit>
</body>

Step 3: Using a simple XSLT transformation create new segmented XLIFF file:

<body segmented="true" srx="http://www.xml-intl.com/srx/en-US.srx";>
   <trans-unit id="1.1">
     <source xml:lang="en-US">The Document Title</source>
   </trans-unit>
   <trans-unit id="2.1">
     <source xml:lang="en-US">First sentence.</source>
   </trans-unit>
   <trans-unit id="2.2">
     <source xml:lang="en-US">This is an important sentence.</source>
   </trans-unit>
   <trans-unit id="3.1">
     <source xml:lang="en-US">Ambiguous sentence.</source>
   </trans-unit>
   <trans-unit id="3.1">
     <source xml:lang="en-US">More <bpt 
id="1">[LINK-to-toc:</bpt>content<ept id="1">]</ept>.</source>
   </trans-unit>
</body>

And Skeleton file:

<body xmlns:tm="http://www.xml-intl.com/dtd/tm.xsd";>
   <trans-unit id="1">
     <source xml:lang="en-US"><tm:tu id="1.1"><ext 
id="1.1"/></tm:tu></source>
   </trans-unit>
   <trans-unit id="2">
     <source xml:lang="en-US"><tm:tu id="2.1"><ext id="2.1"/></tm:tu> 
<bpt id="1">[ITALIC:</bpt><tm:tu id="2.2"><ext id="2.2"/></tm:tu><ept 
id="1">]</ept></source>
   </trans-unit>
   <trans-unit id="3">
     <source xml:lang="en-US"><tm:tu id="3.1"><ext id="3.1"/></tm:tu> 
<tm:tu id="3.2"><ext id="3.2"/></tm:tu></source>
   </trans-unit>
</body>

Step 3: Put segmented XLIFF file through whatever matching process you 
want to, to produce:

<body segmented="true" srx="http://www.xml-intl.com/srx/en-US.srx";>
   <trans-unit id="1.1">
     <source xml:lang="en-US">The Document Title</source>
     <target xml:lang="sv-SE" state="translated" 
state-qualifier="leveraged-tm">Dokumentrubriken</target>
   </trans-unit>
   <trans-unit id="2.1">
     <source xml:lang="en-US">First sentence.</source>
     <target xml:lang="sv-SE" state="translated" 
state-qualifier="leveraged-tm">Första meningen.</target>
   </trans-unit>
   <trans-unit id="2.2">
     <source xml:lang="en-US">This is an important sentence.</source>
       <alt-trans origin="transation memory" match-quality="80%">
         <source xml:lang="en-US">This is an extremely important 
sentence.</source>
         <target xml:lang="sv-SE">En mycket viktig mening.</target>
       </alt-trans>
   </trans-unit>
   <trans-unit id="3.1">
     <source xml:lang="en-US">Ambiguous sentence.</source>
     <target xml:lang="sv-SE" state="needs-review-translation">Omstridd 
mening.</target>
       <note annotates="target" from="Swedish Translator">This 
translation may not be appropriate. Please evaluate it carefully!</note>
   </trans-unit>
   <trans-unit id="3.1">
     <source xml:lang="en-US">More <bpt 
id="1">[LINK-to-toc:</bpt>content<ept id="1">]</ept>.</source>
     <taget xml:lang="sv-SE" state="translated">Ytterligare <bpt 
id="1">[LINK-to-toc:</bpt>inneha*ll<ept id="1">]</ept>.</target>
   </trans-unit>
</body>


Step 4: Using nothing more than XSLT, merge the translated document 
back, then strip out the segmented namespace elements using another 
simple XSLT transformation and you arrive at a translated XLIFF file 
that is equal to the original source language unsegmented file.

This approach has the benefit of requiring minimal or possibly no change 
to the existing excellent XLIFF specification.

Hope this helps kick off the thread.

Regards,

AZ

-- 


email - azydron@xml-intl.com
smail - Mr. A.Zydron
         24 Maybrook Gardens,
         High Wycombe,
         Bucks HP13 6PJ
Mobile +(44) 7966 477181
FAX    +(44) 870 831 8868
www - http://www.xml-intl.com

This message contains confidential information and is intended only
for the individual named.  If you are not the named addressee you
may not disseminate, distribute or copy this e-mail.  Please
notify the sender immediately by e-mail if you have received this
e-mail by mistake and delete this e-mail from your system.
E-mail transmission cannot be guaranteed to be secure or error-free
as information could be intercepted, corrupted, lost, destroyed,
arrive late or incomplete, or contain viruses.  The sender therefore
does not accept liability for any errors or omissions in the contents
of this message which arise as a result of e-mail transmission.  If
verification is required please request a hard-copy version. Unless
explicitly stated otherwise this message is provided for informational
purposes only and should not be construed as a solicitation or offer.