xliff-seg message

Subject: RE: [xliff-seg] Segmentation and filters
From: Magnus Martikainen <magnus@trados.com>
To: Andrzej Zydron <azydron@xml-intl.com>
Date: Mon, 12 Apr 2004 17:06:01 -0700
Hi Andrzej,

Thanks for your reply. Perhaps we are starting to reach a better
understanding of segmentation now. Let me explain:

Regarding your request about a "merged" attribute for translation units I am
afraid that in the current version of XLIFF this will introduce
difficulties.
The problem stems from the fact that the division of the translatable
content into <trans-unit>s and skeleton is a process that in XLIFF is left
entirely up to the filter that produces the XLIFF file. 
No tool that process XLIFF files (other than the filter) can make any
assumptions about the relations between two subsequent <trans-unit> elements
in the XLIFF file.

Thus in your example there is in fact no way any XLIFF editing tool can
determine that <trans-unit id="1.1"> and <trans-unit id="1.2"> can be merged
and translated as one piece of text. They could be totally unrelated. E.g.
the first <trans-unit> could be a document heading and the second one could
be the content of a table cell, or a call-out for an image that appears
somewhere under the heading.

Since the relation between <trans-unit> elements is undefined by XLIFF and
hidden by the filter (in the skeleton) we can never permit two <trans-unit>
elements to be translated as one, since they may in fact not be related at
all. 
That is why I refer to the <trans-unit> as a "hard" segmentation boundary.
They are "set in stone" by the filter that produces the XLIFF, due to the
fact that the use of the skeleton mechanism is undefined.

For it to be possible to merge two <trans-unit>s we would need to introduce
a mechanism to indicate that the boundaries between certain subsequent
<trans-unit> elements can be treated differently (i.e. as "softer"
boundaries). This is to a large extent what segmentation support in XLIFF is
all about...
In fact an alternate representation of segmentation in XLIFF than the one I
used in my document could be based on such a mechanism. (That is a different
discussion - let's leave it for now, until we have come to a consensus about
the need for segmentation and the different scenarios and use cases for it.)

I hope this also clarifies a bit more why I distinguish between "hard" and
"soft" segmentation boundaries, as you mention in B) below.


A) (i) To clarify that we are indeed talking about the same issue here,
could you perhaps explain the mechanism you use to automatically transfer
<target> element specific changes in one version of XLIFF (X) into a
differently segmented, translated XLIFF file (Y) to produce an updated
version of that segmented XLIFF file (Z) as in the example I described:

(X) Updated XLIFF file in original format (this is what the localisation
agency received as update from content owner in my example):

<trans-unit id="1">
   <source xml:lang="en-US">Long sentence. Short sentence.</source>
   <target xml:lang="sv-SE" state="final">Lång mening. Mer mening. Kort
 mening.</target>
 </trans-unit>


(Y) Differently Segmented XLIFF, original version, in working format (the
translation agency's working file with segmentation corresponding to the
translation memory, as it were before it was converted to original format
and delivered to the content owner):

<trans-unit id="1.1">
   <source xml:lang="en-US">Long sentence.</source>
   <target xml:lang="sv-SE" state="translated">Lång mening.</target>
 </trans-unit>
 <trans-unit id="1.2">
   <source xml:lang="en-US">Short sentence.</source>
   <target xml:lang="sv-SE" state="translated">Kort mening.</target>
 </trans-unit>


I can see no way that this can be correctly and safely handled in an
automated fashion. The file (Z) could be either:

(Z alternative 1) The new sentence in <target> belongs to the first
translation unit:

<trans-unit id="1.1">
   <source xml:lang="en-US">Long sentence.</source>
   <target xml:lang="sv-SE" state="translated">Lång mening. Mer
mening.</target>
 </trans-unit>
 <trans-unit id="1.2">
   <source xml:lang="en-US">Short sentence.</source>
   <target xml:lang="sv-SE" state="translated">Kort mening.</target>
 </trans-unit>


Or:

(Z alternative 2) The new sentence in <target> belongs to the second
translation unit:

<trans-unit id="1.1">
   <source xml:lang="en-US">Long sentence.</source>
   <target xml:lang="sv-SE" state="translated">Lång mening. </target>
 </trans-unit>
 <trans-unit id="1.2">
   <source xml:lang="en-US">Short sentence.</source>
   <target xml:lang="sv-SE" state="translated">Kort mening. Mer
mening.</target>
 </trans-unit>

Or even something completely different. As far as I can see neither a human
nor a computer may with absolute certainty be able to determine how this
should be handled.


A) (ii) If you convert the file to a format that is supported by the
validation tool, how would you easily handle the situation where the
validation tool also alters the file (e.g. it may have an "auto-fix" feature
for certain common problems)?


Looking forward to your comments!

Best regards,
Magnus

-----Original Message-----
From: Andrzej Zydron [mailto:azydron@xml-intl.com] 
Sent: Sunday, April 11, 2004 2:13 PM
Cc: xliff-seg@lists.oasis-open.org
Subject: Re: [xliff-seg] Segmentation and filters

Hi Magnus,

Many thanks for your reply.

The discussions to date have been very useful and have allowed me to explore
in 
my own mind some of the issues involved more thoroughly.

As a consequence I have one firm request of the XLIFF TC and that is for 
clarification regarding a mechanism for signifying that a translator
requires 
that a translation of a <trans-unit> is "merged" with one or more preceding 
<trans-unit>.

In the following example I have extended the "translate" attribute merely to

show the type of effect required:

<trans-unit id="1.1">
<source>Badly worded source text first sentence</source>
<target>Zle wyslowione zdanie pierwsze oraz drugie jako jedno
zdanie</target>
</trans-unit>
<trans-unit id="1.2" translate="merged">
<source>Second part of badly worded sentence</source>
<target/>
</trans-unit>



A) Regarding Magnus' comments on the use of multiple converted XLIFF files:

i. I am surprised by your statement that "there is no way to automatically 
transfer the changes from the XLIFF file received from the content owner
into 
the previously translated segmented XLIFF file". I have been doing just this
on 
a daily basis for the past 4 years.

ii. I must also take issue with the statement "the use of
specialized tools, e.g. to validate or adapt the underlying data in the
original XLIFF format, cannot be used directly on the segmented version of
the file". It is only a problem if you want it to be a problem. Please
remember 
that XML is extremely flexible. If the presence of a namespace precludes
certain 
activities then you just remove it in the version of the file that is used
for 
validation. Alternatively the XLIFF specification could allow for the
specific 
presence of the required namespace.


B) Regarding this point:

I would like to posit that there is no such thing as "soft" or "hard" 
segmentation. This artificial taxonomy merely serves to complicate the 
discussion. There is just segmentation. You may wish to disregard the
original 
segmentation or preserve it but in the end it is all segmentation and there
is 
nothing "soft" or "hard" about it.

Best Regards,

AZ


Regarding your answers

Magnus Martikainen wrote:

> Hi Andrzej,
> 
> Thank you for your email - I really appreciate you taking time on this.
> 
> A) Regarding the use of multiple converted XLIFF files:
> 
> I am not sure I fully understand what you are saying, so please correct me
> if I'm wrong.
> 
> I assume that what you are suggesting is that the segmented (i.e. double
> XLIFF converted) file should be "the" XLIFF file from this point on, and
> that it is never again necessary to work on the original XLIFF format?
> 
> I can see a couple of important issues with that. For example consider
this
> scenario: 
> - The original XLIFF file is produced by the content owner and handed off
to
> a localisation agency for translation. 
> - The localisation agency segments the XLIFF file in order to achieve
> maximum reuse from the translation memory.
> - The segmented XLIFF file is processed and translated, and is ready to be
> handed off to the content owner.
> 
> The content owner would clearly expect to receive an XLIFF file of the
same
> type that they handed off to translation. If they receive the segmented
> XLIFF file they may not even have tools to be able to get their content
back
> into the system it originates from. 
> 
> Thus, in order to deliver the XLIFF file to the content owner the
> localisation agency must convert it back to its original format.
> 
> We now have a situation where for the content owner "the" XLIFF file is
not
> the segmented XLIFF file, but rather the file they received back from the
> localisation agency, while for the localisation agency the segmented XLIFF
> file is still "the" XLIFF file (since it corresponds to the segmentation
> used in the translation memory).
> 
> Assuming that the content owner needs to make changes to the XLIFF
> translated file for one reason or another. It is clearly desirable for the
> localisation agency to update their linguistic assets with these changes
so
> that the same changes never need to be made again. For this reason the
> content owner sends the updated XLIFF file to the localisation agency.
> 
> In this situation there is no way to automatically transfer the changes
from
> the XLIFF file received from the content owner into the previously
> translated segmented XLIFF file (assuming the localisation agency has kept
> it). This fact is independent of whether the data is in XML or any other
> format. The problem is that linguistic knowledge is required to safely and
> correctly identify source and target translation pairs from two bodies of
> text. This we can never achieve with XML transformations...
> 
> 
> Another important issue that I previously pointed out is that the use of
> specialised tools, e.g. to validate or adapt the underlying data in the
> original XLIFF format, cannot be used directly on the segmented version of
> the file. If the content owner provides such a tool to the localisation
> agency it may not be easy for them to use it.
> 
> 
> All of these problems go away if we can keep the data in one XLIFF format
> throughout the localisation process. This reduces complexity immensely for
> the whole process.
> 
> 
> B) Regarding your additional comments:
> 
> 1) I fully and completely agree with you that there will always be cases
> where segmentation must be adjusted or adopted afterwards.
> In fact that is the major reason that I suggest treating the segment
> boundaries as "soft" boundaries inside a <trans-unit>, as opposed to the
> "hard" boundaries of the <trans-unit>.
> As you correctly observe, the segment boundaries can be reconfigured by
any
> tool that process the files. That is the whole point, and that is why I
> refer to them as "soft" boundaries.
> The difference between changing segmentation for <trans-unit> and "soft"
> segments is that the later does not require the creation of a new XLIFF
> format. Any tools for processing the original XLIFF format can still be
> used, even though the linguistic content has been segmented.
> 
> 2) The document introduces the <segment> element as one suggestion on how
> the segmentation mechanism could be implemented in XLFF - simply because
it
> would be nearly impossible to give any examples if I did not choose one
way
> to express the segmentation. However I hope I also made it clear that this
> is just one of the possibilities we have for implementation, there are
many
> other options which we should carefully consider and discuss once we have
> all agreed that there is an actual need for it.
> 
> My suggestion would be to focus the discussion on the "why" until we reach
> consensus and then look closer into the "how". 
> 
> Thanks again for taking your time on this.
> 
> Cheers,
> Magnus
> 
> 
> -----Original Message-----
> From: Andrzej Zydron [mailto:azydron@xml-intl.com] 
> Sent: Thursday, April 01, 2004 12:32 PM
> To: xliff-seg@lists.oasis-open.org
> Subject: Re: [xliff-seg] Segmentation and filters
> 
> Hi Magnus,
> 
> Thank you for your reply. Sorry it has taken me a few days to reply, but I
> have
> been very busy and the issues raised are quite complex.
> 
> Having spent a long time analyzing the contents of your email I still do
not
> quite understand your statement "- It is NOT possible to automatically
> identify
> the translations for each of the source language segments in a way that is
> guaranteed to be correct.": We have the alignment in detail in the
segmented
> XLIFF file. We also have it in the merged segmented namespace version of
the
> original XLIFF file.
> 
> I think (correct me if I am wrong) that you are disregarding the segmented
> XLIFF
> file completely from the equation as concerns building leveraged memory.
> This
> would be wrong - the segmented XLIFF file is in fact "the" XLIFF file. The
> original version is merely an XML document that we have extracted from.
> 
> What you must not loose sight of is that an XLIFF file is an XML file and
> can be
> treated just as any XML file can.
> 
> I have also reread your original HTML document in detail and have the
> following
> observations:
> 
> 1) I fail to see any real difference between the so called "hard
boundaries"
> and
> what is being proposing. I think there is a danger here of confusing a
> standard 
> with a methodology. This confusion exists at two levels:
> 
> i. At the standard level any element that is permissible at a given point
in
> the
> DOM can be used. So if you allow <segment> as a child of <trans-unit>
there
> is
> no control over who and how will use it.
> ii. At the application level the document seems to imply that your way of
> doing
> segmentation is always going to be without any need for correction. This
is
> not
> stated in so many words which is what makes matters more confusing, but it
> is what I have implied from the fact that no corrective mechanism for 
> segmentation errors has been provided. Segmentation errors (just like
death
> and 
> taxes) are one of the few certainties in life because:
> 
> a) No segmentation algorithm however well conceived can account for all
> eventualities. There is always a possibility of a new set of circumstances
> which
> will result in incorrect segmentation. It is the nature of the infinite
> possible
> combination of words/acronyms/abbreviations etc. that are possible in
> translatable text.
> b) No segmentation algorithm can account for badly authored source text
that
> defies simple segment for segment translation.
> c) There is no control over how the <segment> element will be used. You
> cannot
> mandate any single algorithm which in any event will by its very nature
> never be
> totally correct.
> d) Once <segment> elements are introduced they will always become "hard
> boundaries" because the segmented XLIFF file may be sent to another
> supplier. 
> One cannot assume that this will always occur within one system.
> 
> So at the very least the <segment> element will require a corrective
> mechanism
> for merging <segment> elements.
> 
> The problem with the introduction of the <segment> element is that it
> imposes a
> compatibility problem for companies who have already written software to
> handle
> XLIFF files. The relationship of the <trans-unit> to <source> and <target>
> elements is an important one which many companies have written into their 
> software. With the advent of <segment> elements this relationship becomes
> much 
> more complex.
> 
> For me personally it is not such an issue as I would just do a secondary
> extraction anyway.
> 
> 
> Best Regards,
> 
> AZ
> 
> 
>>Hi Andrzej,
>>
>>Thank you for your reply. I'm afraid my explanation of the issue with
>>identifying segment boundaries after backward conversion of a "double
>>converted" XLIFF file may not have been clear enough. I believe I confused
>>matters by introducing unnecessary complexity where a single source
> 
> language
> 
>>sentence is translated as more than one sentence. 
>>
>>My intention was to illustrate the fact that once the "double converted"
>>XLIFF file has been converted back to its initial XLIFF format there is no
>>way to safely bring it back to its state as a "double converted" XLIFF
> 
> file
> 
>>again. This may be necessary e.g. in order to update a translation memory
>>with changes made after the backward conversion.
>>
>>Let me simplify my original example to demonstrate the issue.
>>
>>Assume that the translated "double converted" XLIFF file looks like in the
>>original example:
>>
>> <trans-unit id="1.1">
>>   <source xml:lang="en-US">Long sentence.</source>
>>   <target xml:lang="sv-SE" state="translated">En lång mening.</target>
>> </trans-unit>
>> <trans-unit id="1.2">
>>   <source xml:lang="en-US">Short sentence.</source>
>>   <target xml:lang="sv-SE" state="translated">Kort mening.</target>
>> </trans-unit>
>>
>>The reason we segmented the file was to achieve the best possible
> 
> recycling
> 
>>of previous translations from a translation memory, so the each trans-unit
>>now perfectly matches the segmentation used in the TM.
>>When translation is finished the translation memory will contain one
>>translation unit for each of these segments. If exported to TMX it could
>>look like this:
>>
>><tmx ...>
>>...
>><body>
>>  <tu id="1">
>>    <tuv lang="EN-US">
>>      <seg>Long sentence.</seg>
>>    </tuv>
>>    <tuv lang="SV-SE">
>>      <seg>En lång mening.</seg>
>>    </tuv>
>>  </tu>
>>
>>  <tu id="2">
>>    <tuv lang="EN-US">
>>      <seg>Short sentence.</seg>
>>    </tuv>
>>    <tuv lang="SV-SE">
>>      <seg>Kort mening.</seg>
>>    </tuv>
>>  </tu>
>></body>
>></tmx>
>>
>>The translations can be changed as part of the editing and proof-reading
>>process, and it is obviously desirable to update the translation memory
> 
> with
> 
>>any such changes. This is easy as long as the file remains in this "double
>>converted" format, since each <trans-unit> corresponds to a single
>>translation unit in the translation memory. A tool can simply iterate over
>>each segment in the XLIFF file, find it in the translation memory and
> 
> update
> 
>>the corresponding translation.
>>
>>At some point the "double converted" XLIFF file must be converted back to
>>its original XLIFF format. In this example it will look like this:
>>
>> <trans-unit id="1">
>>   <source xml:lang="en-US">Long sentence. Short sentence.</source>
>>   <target xml:lang="sv-SE" state="translated">En lång mening. Kort
>>mening.</target>
>> </trans-unit>
>>
>>Assuming that changes to the translation are needed, and the file is
> 
> updated
> 
>>to look like this:
>>
>><trans-unit id="1">
>>   <source xml:lang="en-US">Long sentence. Short sentence.</source>
>>   <target xml:lang="sv-SE" state="translated">Lång mening. Kort
>>mening.</target>
>> </trans-unit>
>>
>>Obviously it is still desirable to also update the translation memory with
>>these changes, to avoid having to correct the same translation again in
> 
> the
> 
>>future. This is where we run into problems.
>>- In order to update the translation memory we need to convert the source
>>and target of the <trans-unit> into text segmented according to the rules
>>used for the translation memory.
>>
>>Unfortunately this is no longer possible.
>>- It is possible to reconstruct the segmented source content, by simply
>>applying the same algorithm that was used to create the original version
> 
> of
> 
>>the "double converted" XLIFF file.
>>- It is NOT possible to automatically identify the translations for each
> 
> of
> 
>>the source language segments in a way that is guaranteed to be correct.
> 
> This
> 
>>requires an alignment process, which in order to succeed well requires a
>>linguistic understanding of both the source and the target languages.
> 
> There
> 
>>may be more than one way to divide the target content into translations
>>matching the source segments, e.g. if a sentence with new information has
>>been introduced in the target. In such a case it would even be impossible
>>for a human to ensure that the source and target content is correctly
>>matched up in segments that correspond to the translation memory.
>>
>>In the example above it looks easy to identify the corresponding target
>>language segments as the individual sentences:
>>
>><trans-unit id="1.1">
>>   <source xml:lang="en-US">Long sentence.</source>
>>   <target xml:lang="sv-SE" state="translated">Lång mening.</target>
>> </trans-unit>
>> <trans-unit id="1.2">
>>   <source xml:lang="en-US">Short sentence.</source>
>>   <target xml:lang="sv-SE" state="translated">Kort mening.</target>
>> </trans-unit>
>>
>>However we must not forget that this involves a process of "guessing"
> 
> which
> 
>>parts of the source and target that belongs together, as explained above.
> 
> It
> 
>>was to illustrate the problem with the "guessing" that I in my original
>>example introduced a change where the long sentence had been translated
> 
> into
> 
>>two target language sentences:
>>
>> <trans-unit id="1">
>>   <source xml:lang="en-US">Long sentence. Short sentence.</source>
>>   <target xml:lang="sv-SE" state="final">Lång mening. Mer mening. Kort
>> mening.</target>
>> </trans-unit>
>>
>>Here a tool that relies on matching up sentences between the source and
> 
> the
> 
>>target would have several options. From the tool's perspective the correct
>>segmentation could be either:
>>
>><trans-unit id="1.1">
>>   <source xml:lang="en-US">Long sentence.</source>
>>   <target xml:lang="sv-SE" state="translated">Lång mening. Mer
>>mening.</target>
>> </trans-unit>
>> <trans-unit id="1.2">
>>   <source xml:lang="en-US">Short sentence.</source>
>>   <target xml:lang="sv-SE" state="translated">Kort mening.</target>
>> </trans-unit>
>>
>>Or:
>>
>><trans-unit id="1.1">
>>   <source xml:lang="en-US">Long sentence.</source>
>>   <target xml:lang="sv-SE" state="translated">Lång mening. </target>
>> </trans-unit>
>> <trans-unit id="1.2">
>>   <source xml:lang="en-US">Short sentence.</source>
>>   <target xml:lang="sv-SE" state="translated">Kort mening. Mer
>>mening.</target>
>> </trans-unit>
>>
>>Without understanding the source and target languages there is no way the
>>tool can know which of these options are correct.
>>
>>I hope this clarifies matters a bit.
>>
>>Best regards,
>>Magnus
>>
>>-----Original Message-----
>>From: Andrzej Zydron [mailto:azydron@xml-intl.com] 
>>Sent: Saturday, March 27, 2004 11:07 AM
>>To: xliff-seg@lists.oasis-open.org
>>Subject: Re: [xliff-seg] Segmentation and filters
>>
>>Hi Magnus,
>>
>>Thank you for your reply. It looks as if we are together in total
> 
> agreement
> 
>>on 
>>point 1).
>>
>>As to point 2) thank you for pointing out the potential problems involved
>>with
>>segmentation which can arise where:
>>
>>a) The initial segmentation was incorrect.
>>b) The translation requires that more than one segment is rendered in a
>>single 
>>segment in the target language.
>>
>>It would be very positive if we could agree that these two occurrences are
> 
> 
>>definitive regarding this type of problem. In my experience they are.
>>
>> From the semantic point of view a common term such as "merged segments"
>>would 
>>allow us to put these into a single common category.
>>
>>There is another category which I have come across, where there is no
>>equivalent 
>>translation possible for a given segment because for example the segment
>>relates 
>>to a particular address of a subsidiary that just does not exist. In
> 
> xml:tm 
> 
>>these are termed as "no-equivlent translation available" or "no-equiv"
> 
> (for
> 
>>short).
>>
>>Both instances need not pose any problem during the merging process. The
>>target 
>>element for the first trans-unit contains the translation while the
>>following 
>>trans-units are flagged as being merged. Using your example the effect is
> 
> as
> 
>>follows:
>>
>><trans-unit id="1.1">
>>   <source xml:lang="en-US">Long sentence.</source>
>>   <target xml:lang="sv-SE" state="translated">Lång mening. Mer mening.
> 
> Kort
> 
>>mening.</target>
>></trans-unit>
>><trans-unit id="1.2">
>>   <source xml:lang="en-US">Short sentence.</source>
>>   <target xml:lang="sv-SE" state="merged"/>
>></trans-unit>
>>
>>The XLIFF target "state" attribute would require the addition of the
>>"merged" 
>>value, or another attribute other than state can be used. You can now use
>>this 
>>intelligence to load your leveraged memory that will know that if "Long 
>>sentence." is followed by "Short sentence." the equivalent translated text
>>is 
>>"Lång mening. Mer mening. Kort mening." and requires merging. There is no 
>>restriction on the number of merged segments. A merged segment always
>>relates to 
>>the final non-merged segment.
>>
>>When the segmented XLIFF is merged back into the original XLIFF file the
>>effect 
>>is as follows:
>>
>><trans-unit id="1">
>>   <source xml:lang="en-US"><tm:tu id="1.1">Long sentence.</tm:tu>
>>                            <tm:tu id="1.2">Short sentence.</tm:tu>
>>   </source>
>>   <target xml:lang="sv-SE"><tm:tu id="1.1">Lång mening. Mer mening. Kort
>>mening.</tm:tu>
>>                            <tm:tu id="1.2" flag="merged"/>
>>   </target>
>></trans-unit>
>>
>>and the stripped out namespace version will look like this:
>>
>><trans-unit id="1">
>>   <source xml:lang="en-US">Long sentence. Short sentence.</source>
>>   <target xml:lang="sv-SE">Lång mening. Mer mening. Kort mening.</target>
>></trans-unit>
>>
>>Which you can also load into memory.
>>
>>This approach does not restrict or hinder the translation or loading of 
>>leveraged memories in any way. In fact it can supply the translation
> 
> memory 
> 
>>software with additional hints that can automatically compensate for
>>incorrect 
>>segmentation should it occur again.
>>
>>In a similar vein we can also handle non-equivalent segments as in:
>>
>><trans-unit id="1">
>>   <source xml:lang="en-US">The address of our Florida branch is 1
> 
> Manhattan
> 
>>drive, Orlando FLA123</source>
>>   <target xml:lang="sv-SE" state="no-equiv"/>
>></trans-unit>
>>
>>Although I suspect that there may well be another mechanism already built
>>into 
>>XLIFF to handle this.
>>
>>XML provides such a rich vocabulary and syntax that this very easy to
>>overcome 
>>any segmentation issues in XLIFF.
>>
>>Best regards,
>>
>>AZ
>>
>>
>>
>>Magnus Martikainen wrote:
>>
>>
>>
>>>Hi all,
>>>
>>>Thanks very much Andrzej for your clear and structured arguments and
>>>examples, this is very useful for further discussions on this topic.
>>>
>>>Here are my comments to this thread:
>>>
>>>1) I agree that segmentation should not be mandatory. 
>>>However I am also of the opinion that segmentation should always be
>>
>>allowed,
>>
>>
>>>whether the original content was supposedly segmented during extraction
or
>>>not. The reason is that detection of best possible segment boundaries may
>>>still need to be adapted to best fit the tools and the translation
>>
>>memories
>>
>>
>>>used during the localisation, which could use slightly different
>>>segmentation. (A common example would be handling of tag at sentence
>>>boundaries etc.) Since the goal for the user must be to achieve maximum
>>>leverage from their translation memory resources it may be necessary to
>>>adjust segmentation also in such cases.
>>>As a side effect of this, if we agree that we always want to "allow"
>>>segmentation of the content I see no need for an explicit
>>>segmented="true/false" attribute.
>>>
>>>
>>>2) I can think of a couple of situations where a "double conversion" as
>>
>>you
>>
>>
>>>are suggesting would cause problems:
>>>
>>>a) It applies new "hard" boundaries to the segments. Thus segmentation
>>>cannot be changed later, e.g. during interactive translation. Sometimes
it
>>>is necessary or desirable to change the default segmentation to
>>
>>accommodate
>>
>>
>>>translation needs while working on the document. Examples include:
>>>- the need to adjust segmentation that has been incorrectly applied (e.g.
>>
>>an
>>
>>
>>>abbreviation in the middle of a sentence that has been wrongly
interpreted
>>>by the segmentation tool as the end of that sentence).
>>>- the occasional need to translate two or more source sentences into one
>>>target language sentence for it to be a meaningful translation
>>>
>>>
>>>b) During backward conversion of the "doubly converted" XLIFF file to its
>>>original XLIFF format the segment boundaries are lost.
>>>If changes are made to the content of the XLIFF file after it has been
>>>converted back to its original XLIFF format it is no longer possible to
>>
>>get
>>
>>
>>>those changes back into the "double converted" XLIFF document, e.g. in
>>
>>order
>>
>>
>>>to update a translation memory with those changes. Once converted back,
>>
>>the
>>
>>
>>>segment boundaries in both the source and target segments are gone. (The
>>>source segmentation can perhaps be re-created, but it is no longer
>>
>>possible
>>
>>
>>>to with certainty determine the correct corresponding target segments.)
>>>
>>>Example: If the "working" XLIFF file (the segmented version which is used
>>
>>to
>>
>>
>>>interact with the translation memory during translation) after
translation
>>>contains this:
>>>
>>><trans-unit id="1.1">
>>> <source xml:lang="en-US">Long sentence.</source>
>>> <target xml:lang="sv-SE" state="translated">En lång mening.</target>
>>></trans-unit>
>>><trans-unit id="1.2">
>>> <source xml:lang="en-US">Short sentence.</source>
>>> <target xml:lang="sv-SE" state="translated">Kort mening.</target>
>>></trans-unit>
>>>
>>>Both of these trans-units belong to the same <trans-unit> in the original
>>>XLIFF file, and when the XLIFF file is converted back to its original
>>
>>XLIFF
>>
>>
>>>format it could look like this (depending on the content of the skeleton
>>>file):
>>>
>>><trans-unit id="1">
>>> <source xml:lang="en-US">Long sentence. Short sentence.</source>
>>> <target xml:lang="sv-SE" state="translated">En lång mening. Kort
>>>mening.</target>
>>></trans-unit>
>>>
>>>Now someone decides in the last minute that the translation needs to be
>>>changed - the long sentence is for some reason better translated as two
>>>sentences. This change is approved and signed-off. The XLIFF file is
>>
>>changed
>>
>>
>>>into:
>>>
>>><trans-unit id="1">
>>> <source xml:lang="en-US">Long sentence. Short sentence.</source>
>>> <target xml:lang="sv-SE" state="final">Lång mening. Mer mening. Kort
>>>mening.</target>
>>></trans-unit>
>>>
>>>Unfortunately there is no way to easy way to update the translation
memory
>>>with these changes since the original segment boundaries that were used
>>>during translation were lost. 
>>>Tools can of course try to automatically "guess" the segment boundaries
in
>>>the source and target and somehow match them up, but this is not a
trivial
>>>task as can be seen from this example. There is no way an automatic tool
>>
>>can
>>
>>
>>>determine if the middle sentence in the modified target should be paired
>>>with the first or the last sentence (if either). Thus there is no way to
>>>safely update the translation memory automatically with these changes.
>>>
>>>
>>>c) The converted XLIFF file looses its "identity" or its direct
connection
>>>with the underlying data format.
>>>Tools that have been developed to specifically process a particular file
>>>type wrapped in XLIFF can not be used on the "converted" XLIFF file
since:
>>>- the original skeleton is no longer available/usable
>>>- some of the content in the original XLIFF file (in particular tags
>>
>>between
>>
>>
>>>sentences) has moved into the new skeleton.
>>>- the new skeleton has been created with a tool and process unknown to
any
>>>other XLIFF tools.
>>>
>>>A typical example of a tool that would be useful to be able to run during
>>>the localisation process is a verification/validation tool to ascertain
>>
>>that
>>
>>
>>>the translated content can be converted back to a valid original format.
>>>Examples of validation tools:
>>>- Tag verification, validation against the schema, DTD, or other rules
>>
>>that
>>
>>
>>>the content must adhere to.
>>>- Length verification to ensure that translated content does not exceed
>>>length limitations (which may be specified explicitly in the XLIFF file).
>>>Both of these tasks require dealing with the underlying native data that
>>
>>the
>>
>>
>>>XLIFF file wraps in order to perform their jobs. Due to the reasons
stated
>>>above the "doubly converted" XLIFF files cannot be used for this.
>>>
>>>
>>>d) An additional level of unnecessary complexity is introduced, since it
>>
>>is
>>
>>
>>>necessary to do an additional transformation/conversion of the XLIFF
>>>document before it can be processed by the filter that created it.
>>>In something as complex as a typical localisation project this is not a
>>>factor to be neglected. If an average project has 100 files translated
>>
>>into
>>
>>
>>>10 languages that means an additional 1000 file conversions necessary to
>>>complete the project. If the workflow for this is not entirely automated
>>
>>it
>>
>>
>>>could mean that someone may need to use a tool to manually check each
>>>individual file to determine which state it is in before the files can be
>>>delivered or further processed.
>>>If other tools used in the localisation process use the same approach of
>>>converting the XLIFF file to a new XLIFF format the complexity is
>>>multiplied... All this can be avoided if we support the notion of
segments
>>>directly in XLIFF - then the very same XLIFF file can be used in all
>>
>>stages
>>
>>
>>>of the process.
>>>
>>>
>>>Looking forward to further comments and discussions on this topic!
>>>
>>>Best regards,
>>>Magnus Martikainen
>>>TRADOS Inc.
>>>
>>>-----Original Message-----
>>>From: Andrzej Zydron [mailto:azydron@xml-intl.com] 
>>>Sent: Tuesday, March 23, 2004 2:07 PM
>>>To: xliff-seg@lists.oasis-open.org
>>>Subject: [xliff-seg] Segmentation and filters
>>>
>>>Hi,
>>>
>>>First of all I would like to thank Magnus for the hard work he has put 
>>>in so far and the detailed document that he has prepared. This has 
>>>provided a clear starting point for further discussions.
>>>
>>>To kick off this thread I would like to state my views on the 
>>>segmentation issue:
>>>
>>>1) Segmentation within XLIFF should not be mandated. It should be 
>>>optional. There are implementations such as xml:tm where segmentation is 
>>>done before extraction. It is also quite easy to envisage situations 
>>>where XLIFF is the output of an existing translation workbench system 
>>>that has already segmented and pre-matched data for sending out to a 
>>>translator who will import it into an XLIFF aware editing environment.
>>>
>>>I can also see Magnus' point that quite often XLIFF will contain 
>>>unsegmented data.
>>>
>>>One solution would be to provide an optional "segmented" attribute at 
>>>the <file> element level which states that the data has already been 
>>>segmented, with a default value of "false". If the data has been 
>>>segmented than an xlink attribute to the SRX url could also be provided.
>>>
>>>2) One way of handling segmentation within XLIFF is to create a 
>>>secondary XLIFF document from the current XLIFF document that has a 
>>>separate <trans-unit> element for each segment. This would effectively 
>>>be an segmentation extraction of the original XLIFF document. This has 
>>>the one significant advantage that no further extensions are required to 
>>>the XLIFF standard. It does away with all the potential complexity of 
>>>trying to nest <trans-unit> elements or add workable syntax to cope with 
>>>multiple source and target segments within a <trans-unit>.
>>>
>>>Because XLIFF is a well defined XML format it is very easy to write an 
>>>extraction + segmentation filter for it to provide an XLIFF file where 
>>>the <trans-unit> elements are at the segment level, along with a 
>>>skeleton file for merging back.
>>>
>>>After translation you can elect to store leveraged memory at both the 
>>>segmented and unsegmeted levels.
>>>
>>>Here is an example based on Magnus' data:
>>>
>>>Step 1: Original XLIFF file:
>>>
>>><body>
>>>  <trans-unit id="1">
>>>    <source xml:lang="en-US">The Document Title</source>
>>>  </trans-unit>
>>>  <trans-unit id="2">
>>>    <source xml:lang="en-US">First sentence. <bpt 
>>>id="1">[ITALIC:</bpt>This is an important sentence.<ept 
>>>id="1">]</ept></source>
>>>  </trans-unit>
>>>  <trans-unit id="3">
>>>    <source xml:lang="en-US">Ambiguous sentence. More <bpt 
>>>id="1">[LINK-to-toc:</bpt>content<ept id="1">]</ept>.</source>
>>>  </trans-unit>
>>></body>
>>>
>>>Step 2: Introduce namespace segmentation into XLIFF file
>>>
>>><body xmlns:tm="http://www.xml-intl.com/dtd/tm.xsd";>
>>>  <trans-unit id="1">
>>>    <source xml:lang="en-US"><tm:tu id="1.1">The Document 
>>>Title</tm:tu></source>
>>>  </trans-unit>
>>>  <trans-unit id="2">
>>>    <source xml:lang="en-US"><tm:tu id="2.1">First sentence.</tm:tu> 
>>><bpt id="1">[ITALIC:</bpt><tm:tu id="2.2">This is an important 
>>>sentence.</tm:tu><ept id="1">]</ept></source>
>>>  </trans-unit>
>>>  <trans-unit id="3">
>>>    <source xml:lang="en-US"><tm:tu id="3.1">Ambiguous 
>>>sentence.</tm:tu> <tm:tu id="3.2">More <bpt 
>>>id="1">[LINK-to-toc:</bpt>content<ept id="1">]</ept>.</tm:tu></source>
>>>  </trans-unit>
>>></body>
>>>
>>>Step 3: Using a simple XSLT transformation create new segmented XLIFF
>>
>>file:
>>
>>
>>><body segmented="true" srx="http://www.xml-intl.com/srx/en-US.srx";>
>>>  <trans-unit id="1.1">
>>>    <source xml:lang="en-US">The Document Title</source>
>>>  </trans-unit>
>>>  <trans-unit id="2.1">
>>>    <source xml:lang="en-US">First sentence.</source>
>>>  </trans-unit>
>>>  <trans-unit id="2.2">
>>>    <source xml:lang="en-US">This is an important sentence.</source>
>>>  </trans-unit>
>>>  <trans-unit id="3.1">
>>>    <source xml:lang="en-US">Ambiguous sentence.</source>
>>>  </trans-unit>
>>>  <trans-unit id="3.1">
>>>    <source xml:lang="en-US">More <bpt 
>>>id="1">[LINK-to-toc:</bpt>content<ept id="1">]</ept>.</source>
>>>  </trans-unit>
>>></body>
>>>
>>>And Skeleton file:
>>>
>>><body xmlns:tm="http://www.xml-intl.com/dtd/tm.xsd";>
>>>  <trans-unit id="1">
>>>    <source xml:lang="en-US"><tm:tu id="1.1"><ext 
>>>id="1.1"/></tm:tu></source>
>>>  </trans-unit>
>>>  <trans-unit id="2">
>>>    <source xml:lang="en-US"><tm:tu id="2.1"><ext id="2.1"/></tm:tu> 
>>><bpt id="1">[ITALIC:</bpt><tm:tu id="2.2"><ext id="2.2"/></tm:tu><ept 
>>>id="1">]</ept></source>
>>>  </trans-unit>
>>>  <trans-unit id="3">
>>>    <source xml:lang="en-US"><tm:tu id="3.1"><ext id="3.1"/></tm:tu> 
>>><tm:tu id="3.2"><ext id="3.2"/></tm:tu></source>
>>>  </trans-unit>
>>></body>
>>>
>>>Step 3: Put segmented XLIFF file through whatever matching process you 
>>>want to, to produce:
>>>
>>><body segmented="true" srx="http://www.xml-intl.com/srx/en-US.srx";>
>>>  <trans-unit id="1.1">
>>>    <source xml:lang="en-US">The Document Title</source>
>>>    <target xml:lang="sv-SE" state="translated" 
>>>state-qualifier="leveraged-tm">Dokumentrubriken</target>
>>>  </trans-unit>
>>>  <trans-unit id="2.1">
>>>    <source xml:lang="en-US">First sentence.</source>
>>>    <target xml:lang="sv-SE" state="translated" 
>>>state-qualifier="leveraged-tm">Första meningen.</target>
>>>  </trans-unit>
>>>  <trans-unit id="2.2">
>>>    <source xml:lang="en-US">This is an important sentence.</source>
>>>      <alt-trans origin="transation memory" match-quality="80%">
>>>        <source xml:lang="en-US">This is an extremely important 
>>>sentence.</source>
>>>        <target xml:lang="sv-SE">En mycket viktig mening.</target>
>>>      </alt-trans>
>>>  </trans-unit>
>>>  <trans-unit id="3.1">
>>>    <source xml:lang="en-US">Ambiguous sentence.</source>
>>>    <target xml:lang="sv-SE" state="needs-review-translation">Omstridd 
>>>mening.</target>
>>>      <note annotates="target" from="Swedish Translator">This 
>>>translation may not be appropriate. Please evaluate it carefully!</note>
>>>  </trans-unit>
>>>  <trans-unit id="3.1">
>>>    <source xml:lang="en-US">More <bpt 
>>>id="1">[LINK-to-toc:</bpt>content<ept id="1">]</ept>.</source>
>>>    <taget xml:lang="sv-SE" state="translated">Ytterligare <bpt 
>>>id="1">[LINK-to-toc:</bpt>inneha*ll<ept id="1">]</ept>.</target>
>>>  </trans-unit>
>>></body>
>>>
>>>
>>>Step 4: Using nothing more than XSLT, merge the translated document 
>>>back, then strip out the segmented namespace elements using another 
>>>simple XSLT transformation and you arrive at a translated XLIFF file 
>>>that is equal to the original source language unsegmented file.
>>>
>>>This approach has the benefit of requiring minimal or possibly no change 
>>>to the existing excellent XLIFF specification.
>>>
>>>Hope this helps kick off the thread.
>>>
>>>Regards,
>>>
>>>AZ
>>>
>>
>>
> 

-- 


email - azydron@xml-intl.com
smail - Mr. A.Zydron
         24 Maybrook Gardens,
         High Wycombe,
         Bucks HP13 6PJ
Mobile +(44) 7966 477181
FAX    +(44) 870 831 8868
www - http://www.xml-intl.com

This message contains confidential information and is intended only
for the individual named.  If you are not the named addressee you
may not disseminate, distribute or copy this e-mail.  Please
notify the sender immediately by e-mail if you have received this
e-mail by mistake and delete this e-mail from your system.
E-mail transmission cannot be guaranteed to be secure or error-free
as information could be intercepted, corrupted, lost, destroyed,
arrive late or incomplete, or contain viruses.  The sender therefore
does not accept liability for any errors or omissions in the contents
of this message which arise as a result of e-mail transmission.  If
verification is required please request a hard-copy version. Unless
explicitly stated otherwise this message is provided for informational
purposes only and should not be construed as a solicitation or offer.
Follow-Ups:
- Re: [xliff-seg] Segmentation and filters
  - From: Andrzej Zydron <azydron@xml-intl.com>