xliff-seg message

Subject: Re: [xliff-seg] Segmentation and filters
From: Andrzej Zydron <azydron@xml-intl.com>
To: xliff-seg@lists.oasis-open.org
Date: Sat, 27 Mar 2004 19:07:22 +0000
Hi Magnus,

Thank you for your reply. It looks as if we are together in total agreement on 
point 1).

As to point 2) thank you for pointing out the potential problems involved with
segmentation which can arise where:

a) The initial segmentation was incorrect.
b) The translation requires that more than one segment is rendered in a single 
segment in the target language.

It would be very positive if we could agree that these two occurrences are 
definitive regarding this type of problem. In my experience they are.

 From the semantic point of view a common term such as "merged segments" would 
allow us to put these into a single common category.

There is another category which I have come across, where there is no equivalent 
translation possible for a given segment because for example the segment relates 
to a particular address of a subsidiary that just does not exist. In xml:tm 
these are termed as "no-equivlent translation available" or "no-equiv" (for short).

Both instances need not pose any problem during the merging process. The target 
element for the first trans-unit contains the translation while the following 
trans-units are flagged as being merged. Using your example the effect is as 
follows:

<trans-unit id="1.1">
   <source xml:lang="en-US">Long sentence.</source>
   <target xml:lang="sv-SE" state="translated">Lång mening. Mer mening. Kort
mening.</target>
</trans-unit>
<trans-unit id="1.2">
   <source xml:lang="en-US">Short sentence.</source>
   <target xml:lang="sv-SE" state="merged"/>
</trans-unit>

The XLIFF target "state" attribute would require the addition of the "merged" 
value, or another attribute other than state can be used. You can now use this 
intelligence to load your leveraged memory that will know that if "Long 
sentence." is followed by "Short sentence." the equivalent translated text is 
"Lång mening. Mer mening. Kort mening." and requires merging. There is no 
restriction on the number of merged segments. A merged segment always relates to 
the final non-merged segment.

When the segmented XLIFF is merged back into the original XLIFF file the effect 
is as follows:

<trans-unit id="1">
   <source xml:lang="en-US"><tm:tu id="1.1">Long sentence.</tm:tu>
                            <tm:tu id="1.2">Short sentence.</tm:tu>
   </source>
   <target xml:lang="sv-SE"><tm:tu id="1.1">Lång mening. Mer mening. Kort
mening.</tm:tu>
                            <tm:tu id="1.2" flag="merged"/>
   </target>
</trans-unit>

and the stripped out namespace version will look like this:

<trans-unit id="1">
   <source xml:lang="en-US">Long sentence. Short sentence.</source>
   <target xml:lang="sv-SE">Lång mening. Mer mening. Kort mening.</target>
</trans-unit>

Which you can also load into memory.

This approach does not restrict or hinder the translation or loading of 
leveraged memories in any way. In fact it can supply the translation memory 
software with additional hints that can automatically compensate for incorrect 
segmentation should it occur again.

In a similar vein we can also handle non-equivalent segments as in:

<trans-unit id="1">
   <source xml:lang="en-US">The address of our Florida branch is 1 Manhattan 
drive, Orlando FLA123</source>
   <target xml:lang="sv-SE" state="no-equiv"/>
</trans-unit>

Although I suspect that there may well be another mechanism already built into 
XLIFF to handle this.

XML provides such a rich vocabulary and syntax that this very easy to overcome 
any segmentation issues in XLIFF.

Best regards,

AZ



Magnus Martikainen wrote:

> Hi all,
> 
> Thanks very much Andrzej for your clear and structured arguments and
> examples, this is very useful for further discussions on this topic.
> 
> Here are my comments to this thread:
> 
> 1) I agree that segmentation should not be mandatory. 
> However I am also of the opinion that segmentation should always be allowed,
> whether the original content was supposedly segmented during extraction or
> not. The reason is that detection of best possible segment boundaries may
> still need to be adapted to best fit the tools and the translation memories
> used during the localisation, which could use slightly different
> segmentation. (A common example would be handling of tag at sentence
> boundaries etc.) Since the goal for the user must be to achieve maximum
> leverage from their translation memory resources it may be necessary to
> adjust segmentation also in such cases.
> As a side effect of this, if we agree that we always want to "allow"
> segmentation of the content I see no need for an explicit
> segmented="true/false" attribute.
> 
> 
> 2) I can think of a couple of situations where a "double conversion" as you
> are suggesting would cause problems:
> 
> a) It applies new "hard" boundaries to the segments. Thus segmentation
> cannot be changed later, e.g. during interactive translation. Sometimes it
> is necessary or desirable to change the default segmentation to accommodate
> translation needs while working on the document. Examples include:
> - the need to adjust segmentation that has been incorrectly applied (e.g. an
> abbreviation in the middle of a sentence that has been wrongly interpreted
> by the segmentation tool as the end of that sentence).
> - the occasional need to translate two or more source sentences into one
> target language sentence for it to be a meaningful translation
> 
> 
> b) During backward conversion of the "doubly converted" XLIFF file to its
> original XLIFF format the segment boundaries are lost.
> If changes are made to the content of the XLIFF file after it has been
> converted back to its original XLIFF format it is no longer possible to get
> those changes back into the "double converted" XLIFF document, e.g. in order
> to update a translation memory with those changes. Once converted back, the
> segment boundaries in both the source and target segments are gone. (The
> source segmentation can perhaps be re-created, but it is no longer possible
> to with certainty determine the correct corresponding target segments.)
> 
> Example: If the "working" XLIFF file (the segmented version which is used to
> interact with the translation memory during translation) after translation
> contains this:
> 
> <trans-unit id="1.1">
>   <source xml:lang="en-US">Long sentence.</source>
>   <target xml:lang="sv-SE" state="translated">En lång mening.</target>
> </trans-unit>
> <trans-unit id="1.2">
>   <source xml:lang="en-US">Short sentence.</source>
>   <target xml:lang="sv-SE" state="translated">Kort mening.</target>
> </trans-unit>
> 
> Both of these trans-units belong to the same <trans-unit> in the original
> XLIFF file, and when the XLIFF file is converted back to its original XLIFF
> format it could look like this (depending on the content of the skeleton
> file):
> 
> <trans-unit id="1">
>   <source xml:lang="en-US">Long sentence. Short sentence.</source>
>   <target xml:lang="sv-SE" state="translated">En lång mening. Kort
> mening.</target>
> </trans-unit>
> 
> Now someone decides in the last minute that the translation needs to be
> changed - the long sentence is for some reason better translated as two
> sentences. This change is approved and signed-off. The XLIFF file is changed
> into:
> 
> <trans-unit id="1">
>   <source xml:lang="en-US">Long sentence. Short sentence.</source>
>   <target xml:lang="sv-SE" state="final">Lång mening. Mer mening. Kort
> mening.</target>
> </trans-unit>
> 
> Unfortunately there is no way to easy way to update the translation memory
> with these changes since the original segment boundaries that were used
> during translation were lost. 
> Tools can of course try to automatically "guess" the segment boundaries in
> the source and target and somehow match them up, but this is not a trivial
> task as can be seen from this example. There is no way an automatic tool can
> determine if the middle sentence in the modified target should be paired
> with the first or the last sentence (if either). Thus there is no way to
> safely update the translation memory automatically with these changes.
> 
> 
> c) The converted XLIFF file looses its "identity" or its direct connection
> with the underlying data format.
> Tools that have been developed to specifically process a particular file
> type wrapped in XLIFF can not be used on the "converted" XLIFF file since:
> - the original skeleton is no longer available/usable
> - some of the content in the original XLIFF file (in particular tags between
> sentences) has moved into the new skeleton.
> - the new skeleton has been created with a tool and process unknown to any
> other XLIFF tools.
> 
> A typical example of a tool that would be useful to be able to run during
> the localisation process is a verification/validation tool to ascertain that
> the translated content can be converted back to a valid original format.
> Examples of validation tools:
> - Tag verification, validation against the schema, DTD, or other rules that
> the content must adhere to.
> - Length verification to ensure that translated content does not exceed
> length limitations (which may be specified explicitly in the XLIFF file).
> Both of these tasks require dealing with the underlying native data that the
> XLIFF file wraps in order to perform their jobs. Due to the reasons stated
> above the "doubly converted" XLIFF files cannot be used for this.
> 
> 
> d) An additional level of unnecessary complexity is introduced, since it is
> necessary to do an additional transformation/conversion of the XLIFF
> document before it can be processed by the filter that created it.
> In something as complex as a typical localisation project this is not a
> factor to be neglected. If an average project has 100 files translated into
> 10 languages that means an additional 1000 file conversions necessary to
> complete the project. If the workflow for this is not entirely automated it
> could mean that someone may need to use a tool to manually check each
> individual file to determine which state it is in before the files can be
> delivered or further processed.
> If other tools used in the localisation process use the same approach of
> converting the XLIFF file to a new XLIFF format the complexity is
> multiplied... All this can be avoided if we support the notion of segments
> directly in XLIFF - then the very same XLIFF file can be used in all stages
> of the process.
> 
> 
> Looking forward to further comments and discussions on this topic!
> 
> Best regards,
> Magnus Martikainen
> TRADOS Inc.
> 
> -----Original Message-----
> From: Andrzej Zydron [mailto:azydron@xml-intl.com] 
> Sent: Tuesday, March 23, 2004 2:07 PM
> To: xliff-seg@lists.oasis-open.org
> Subject: [xliff-seg] Segmentation and filters
> 
> Hi,
> 
> First of all I would like to thank Magnus for the hard work he has put 
> in so far and the detailed document that he has prepared. This has 
> provided a clear starting point for further discussions.
> 
> To kick off this thread I would like to state my views on the 
> segmentation issue:
> 
> 1) Segmentation within XLIFF should not be mandated. It should be 
> optional. There are implementations such as xml:tm where segmentation is 
> done before extraction. It is also quite easy to envisage situations 
> where XLIFF is the output of an existing translation workbench system 
> that has already segmented and pre-matched data for sending out to a 
> translator who will import it into an XLIFF aware editing environment.
> 
> I can also see Magnus' point that quite often XLIFF will contain 
> unsegmented data.
> 
> One solution would be to provide an optional "segmented" attribute at 
> the <file> element level which states that the data has already been 
> segmented, with a default value of "false". If the data has been 
> segmented than an xlink attribute to the SRX url could also be provided.
> 
> 2) One way of handling segmentation within XLIFF is to create a 
> secondary XLIFF document from the current XLIFF document that has a 
> separate <trans-unit> element for each segment. This would effectively 
> be an segmentation extraction of the original XLIFF document. This has 
> the one significant advantage that no further extensions are required to 
> the XLIFF standard. It does away with all the potential complexity of 
> trying to nest <trans-unit> elements or add workable syntax to cope with 
> multiple source and target segments within a <trans-unit>.
> 
> Because XLIFF is a well defined XML format it is very easy to write an 
> extraction + segmentation filter for it to provide an XLIFF file where 
> the <trans-unit> elements are at the segment level, along with a 
> skeleton file for merging back.
> 
> After translation you can elect to store leveraged memory at both the 
> segmented and unsegmeted levels.
> 
> Here is an example based on Magnus' data:
> 
> Step 1: Original XLIFF file:
> 
> <body>
>    <trans-unit id="1">
>      <source xml:lang="en-US">The Document Title</source>
>    </trans-unit>
>    <trans-unit id="2">
>      <source xml:lang="en-US">First sentence. <bpt 
> id="1">[ITALIC:</bpt>This is an important sentence.<ept 
> id="1">]</ept></source>
>    </trans-unit>
>    <trans-unit id="3">
>      <source xml:lang="en-US">Ambiguous sentence. More <bpt 
> id="1">[LINK-to-toc:</bpt>content<ept id="1">]</ept>.</source>
>    </trans-unit>
> </body>
> 
> Step 2: Introduce namespace segmentation into XLIFF file
> 
> <body xmlns:tm="http://www.xml-intl.com/dtd/tm.xsd";>
>    <trans-unit id="1">
>      <source xml:lang="en-US"><tm:tu id="1.1">The Document 
> Title</tm:tu></source>
>    </trans-unit>
>    <trans-unit id="2">
>      <source xml:lang="en-US"><tm:tu id="2.1">First sentence.</tm:tu> 
> <bpt id="1">[ITALIC:</bpt><tm:tu id="2.2">This is an important 
> sentence.</tm:tu><ept id="1">]</ept></source>
>    </trans-unit>
>    <trans-unit id="3">
>      <source xml:lang="en-US"><tm:tu id="3.1">Ambiguous 
> sentence.</tm:tu> <tm:tu id="3.2">More <bpt 
> id="1">[LINK-to-toc:</bpt>content<ept id="1">]</ept>.</tm:tu></source>
>    </trans-unit>
> </body>
> 
> Step 3: Using a simple XSLT transformation create new segmented XLIFF file:
> 
> <body segmented="true" srx="http://www.xml-intl.com/srx/en-US.srx";>
>    <trans-unit id="1.1">
>      <source xml:lang="en-US">The Document Title</source>
>    </trans-unit>
>    <trans-unit id="2.1">
>      <source xml:lang="en-US">First sentence.</source>
>    </trans-unit>
>    <trans-unit id="2.2">
>      <source xml:lang="en-US">This is an important sentence.</source>
>    </trans-unit>
>    <trans-unit id="3.1">
>      <source xml:lang="en-US">Ambiguous sentence.</source>
>    </trans-unit>
>    <trans-unit id="3.1">
>      <source xml:lang="en-US">More <bpt 
> id="1">[LINK-to-toc:</bpt>content<ept id="1">]</ept>.</source>
>    </trans-unit>
> </body>
> 
> And Skeleton file:
> 
> <body xmlns:tm="http://www.xml-intl.com/dtd/tm.xsd";>
>    <trans-unit id="1">
>      <source xml:lang="en-US"><tm:tu id="1.1"><ext 
> id="1.1"/></tm:tu></source>
>    </trans-unit>
>    <trans-unit id="2">
>      <source xml:lang="en-US"><tm:tu id="2.1"><ext id="2.1"/></tm:tu> 
> <bpt id="1">[ITALIC:</bpt><tm:tu id="2.2"><ext id="2.2"/></tm:tu><ept 
> id="1">]</ept></source>
>    </trans-unit>
>    <trans-unit id="3">
>      <source xml:lang="en-US"><tm:tu id="3.1"><ext id="3.1"/></tm:tu> 
> <tm:tu id="3.2"><ext id="3.2"/></tm:tu></source>
>    </trans-unit>
> </body>
> 
> Step 3: Put segmented XLIFF file through whatever matching process you 
> want to, to produce:
> 
> <body segmented="true" srx="http://www.xml-intl.com/srx/en-US.srx";>
>    <trans-unit id="1.1">
>      <source xml:lang="en-US">The Document Title</source>
>      <target xml:lang="sv-SE" state="translated" 
> state-qualifier="leveraged-tm">Dokumentrubriken</target>
>    </trans-unit>
>    <trans-unit id="2.1">
>      <source xml:lang="en-US">First sentence.</source>
>      <target xml:lang="sv-SE" state="translated" 
> state-qualifier="leveraged-tm">Första meningen.</target>
>    </trans-unit>
>    <trans-unit id="2.2">
>      <source xml:lang="en-US">This is an important sentence.</source>
>        <alt-trans origin="transation memory" match-quality="80%">
>          <source xml:lang="en-US">This is an extremely important 
> sentence.</source>
>          <target xml:lang="sv-SE">En mycket viktig mening.</target>
>        </alt-trans>
>    </trans-unit>
>    <trans-unit id="3.1">
>      <source xml:lang="en-US">Ambiguous sentence.</source>
>      <target xml:lang="sv-SE" state="needs-review-translation">Omstridd 
> mening.</target>
>        <note annotates="target" from="Swedish Translator">This 
> translation may not be appropriate. Please evaluate it carefully!</note>
>    </trans-unit>
>    <trans-unit id="3.1">
>      <source xml:lang="en-US">More <bpt 
> id="1">[LINK-to-toc:</bpt>content<ept id="1">]</ept>.</source>
>      <taget xml:lang="sv-SE" state="translated">Ytterligare <bpt 
> id="1">[LINK-to-toc:</bpt>inneha*ll<ept id="1">]</ept>.</target>
>    </trans-unit>
> </body>
> 
> 
> Step 4: Using nothing more than XSLT, merge the translated document 
> back, then strip out the segmented namespace elements using another 
> simple XSLT transformation and you arrive at a translated XLIFF file 
> that is equal to the original source language unsegmented file.
> 
> This approach has the benefit of requiring minimal or possibly no change 
> to the existing excellent XLIFF specification.
> 
> Hope this helps kick off the thread.
> 
> Regards,
> 
> AZ
> 

-- 


email - azydron@xml-intl.com
smail - Mr. A.Zydron
         24 Maybrook Gardens,
         High Wycombe,
         Bucks HP13 6PJ
Mobile +(44) 7966 477181
FAX    +(44) 870 831 8868
www - http://www.xml-intl.com

This message contains confidential information and is intended only
for the individual named.  If you are not the named addressee you
may not disseminate, distribute or copy this e-mail.  Please
notify the sender immediately by e-mail if you have received this
e-mail by mistake and delete this e-mail from your system.
E-mail transmission cannot be guaranteed to be secure or error-free
as information could be intercepted, corrupted, lost, destroyed,
arrive late or incomplete, or contain viruses.  The sender therefore
does not accept liability for any errors or omissions in the contents
of this message which arise as a result of e-mail transmission.  If
verification is required please request a hard-copy version. Unless
explicitly stated otherwise this message is provided for informational
purposes only and should not be construed as a solicitation or offer.
References:
- RE: [xliff-seg] Segmentation and filters
  - From: Magnus Martikainen <magnus@trados.com>