xliff-seg message

Subject: RE: [xliff-seg] Segmentation and filters
From: Magnus Martikainen <magnus@trados.com>
To: xliff-seg@lists.oasis-open.org
Date: Thu, 1 Apr 2004 16:50:47 -0800
Hi Andrzej,

Thank you for your email - I really appreciate you taking time on this.

A) Regarding the use of multiple converted XLIFF files:

I am not sure I fully understand what you are saying, so please correct me
if I'm wrong.

I assume that what you are suggesting is that the segmented (i.e. double
XLIFF converted) file should be "the" XLIFF file from this point on, and
that it is never again necessary to work on the original XLIFF format?

I can see a couple of important issues with that. For example consider this
scenario: 
- The original XLIFF file is produced by the content owner and handed off to
a localisation agency for translation. 
- The localisation agency segments the XLIFF file in order to achieve
maximum reuse from the translation memory.
- The segmented XLIFF file is processed and translated, and is ready to be
handed off to the content owner.

The content owner would clearly expect to receive an XLIFF file of the same
type that they handed off to translation. If they receive the segmented
XLIFF file they may not even have tools to be able to get their content back
into the system it originates from. 

Thus, in order to deliver the XLIFF file to the content owner the
localisation agency must convert it back to its original format.

We now have a situation where for the content owner "the" XLIFF file is not
the segmented XLIFF file, but rather the file they received back from the
localisation agency, while for the localisation agency the segmented XLIFF
file is still "the" XLIFF file (since it corresponds to the segmentation
used in the translation memory).

Assuming that the content owner needs to make changes to the XLIFF
translated file for one reason or another. It is clearly desirable for the
localisation agency to update their linguistic assets with these changes so
that the same changes never need to be made again. For this reason the
content owner sends the updated XLIFF file to the localisation agency.

In this situation there is no way to automatically transfer the changes from
the XLIFF file received from the content owner into the previously
translated segmented XLIFF file (assuming the localisation agency has kept
it). This fact is independent of whether the data is in XML or any other
format. The problem is that linguistic knowledge is required to safely and
correctly identify source and target translation pairs from two bodies of
text. This we can never achieve with XML transformations...


Another important issue that I previously pointed out is that the use of
specialised tools, e.g. to validate or adapt the underlying data in the
original XLIFF format, cannot be used directly on the segmented version of
the file. If the content owner provides such a tool to the localisation
agency it may not be easy for them to use it.


All of these problems go away if we can keep the data in one XLIFF format
throughout the localisation process. This reduces complexity immensely for
the whole process.


B) Regarding your additional comments:

1) I fully and completely agree with you that there will always be cases
where segmentation must be adjusted or adopted afterwards.
In fact that is the major reason that I suggest treating the segment
boundaries as "soft" boundaries inside a <trans-unit>, as opposed to the
"hard" boundaries of the <trans-unit>.
As you correctly observe, the segment boundaries can be reconfigured by any
tool that process the files. That is the whole point, and that is why I
refer to them as "soft" boundaries.
The difference between changing segmentation for <trans-unit> and "soft"
segments is that the later does not require the creation of a new XLIFF
format. Any tools for processing the original XLIFF format can still be
used, even though the linguistic content has been segmented.

2) The document introduces the <segment> element as one suggestion on how
the segmentation mechanism could be implemented in XLFF - simply because it
would be nearly impossible to give any examples if I did not choose one way
to express the segmentation. However I hope I also made it clear that this
is just one of the possibilities we have for implementation, there are many
other options which we should carefully consider and discuss once we have
all agreed that there is an actual need for it.

My suggestion would be to focus the discussion on the "why" until we reach
consensus and then look closer into the "how". 

Thanks again for taking your time on this.

Cheers,
Magnus


-----Original Message-----
From: Andrzej Zydron [mailto:azydron@xml-intl.com] 
Sent: Thursday, April 01, 2004 12:32 PM
To: xliff-seg@lists.oasis-open.org
Subject: Re: [xliff-seg] Segmentation and filters

Hi Magnus,

Thank you for your reply. Sorry it has taken me a few days to reply, but I
have
been very busy and the issues raised are quite complex.

Having spent a long time analyzing the contents of your email I still do not
quite understand your statement "- It is NOT possible to automatically
identify
the translations for each of the source language segments in a way that is
guaranteed to be correct.": We have the alignment in detail in the segmented
XLIFF file. We also have it in the merged segmented namespace version of the
original XLIFF file.

I think (correct me if I am wrong) that you are disregarding the segmented
XLIFF
file completely from the equation as concerns building leveraged memory.
This
would be wrong - the segmented XLIFF file is in fact "the" XLIFF file. The
original version is merely an XML document that we have extracted from.

What you must not loose sight of is that an XLIFF file is an XML file and
can be
treated just as any XML file can.

I have also reread your original HTML document in detail and have the
following
observations:

1) I fail to see any real difference between the so called "hard boundaries"
and
what is being proposing. I think there is a danger here of confusing a
standard 
with a methodology. This confusion exists at two levels:

i. At the standard level any element that is permissible at a given point in
the
DOM can be used. So if you allow <segment> as a child of <trans-unit> there
is
no control over who and how will use it.
ii. At the application level the document seems to imply that your way of
doing
segmentation is always going to be without any need for correction. This is
not
stated in so many words which is what makes matters more confusing, but it
is what I have implied from the fact that no corrective mechanism for 
segmentation errors has been provided. Segmentation errors (just like death
and 
taxes) are one of the few certainties in life because:

a) No segmentation algorithm however well conceived can account for all
eventualities. There is always a possibility of a new set of circumstances
which
will result in incorrect segmentation. It is the nature of the infinite
possible
combination of words/acronyms/abbreviations etc. that are possible in
translatable text.
b) No segmentation algorithm can account for badly authored source text that
defies simple segment for segment translation.
c) There is no control over how the <segment> element will be used. You
cannot
mandate any single algorithm which in any event will by its very nature
never be
totally correct.
d) Once <segment> elements are introduced they will always become "hard
boundaries" because the segmented XLIFF file may be sent to another
supplier. 
One cannot assume that this will always occur within one system.

So at the very least the <segment> element will require a corrective
mechanism
for merging <segment> elements.

The problem with the introduction of the <segment> element is that it
imposes a
compatibility problem for companies who have already written software to
handle
XLIFF files. The relationship of the <trans-unit> to <source> and <target>
elements is an important one which many companies have written into their 
software. With the advent of <segment> elements this relationship becomes
much 
more complex.

For me personally it is not such an issue as I would just do a secondary
extraction anyway.


Best Regards,

AZ

> Hi Andrzej,
> 
> Thank you for your reply. I'm afraid my explanation of the issue with
> identifying segment boundaries after backward conversion of a "double
> converted" XLIFF file may not have been clear enough. I believe I confused
> matters by introducing unnecessary complexity where a single source
language
> sentence is translated as more than one sentence. 
> 
> My intention was to illustrate the fact that once the "double converted"
> XLIFF file has been converted back to its initial XLIFF format there is no
> way to safely bring it back to its state as a "double converted" XLIFF
file
> again. This may be necessary e.g. in order to update a translation memory
> with changes made after the backward conversion.
> 
> Let me simplify my original example to demonstrate the issue.
> 
> Assume that the translated "double converted" XLIFF file looks like in the
> original example:
> 
>  <trans-unit id="1.1">
>    <source xml:lang="en-US">Long sentence.</source>
>    <target xml:lang="sv-SE" state="translated">En lång mening.</target>
>  </trans-unit>
>  <trans-unit id="1.2">
>    <source xml:lang="en-US">Short sentence.</source>
>    <target xml:lang="sv-SE" state="translated">Kort mening.</target>
>  </trans-unit>
> 
> The reason we segmented the file was to achieve the best possible
recycling
> of previous translations from a translation memory, so the each trans-unit
> now perfectly matches the segmentation used in the TM.
> When translation is finished the translation memory will contain one
> translation unit for each of these segments. If exported to TMX it could
> look like this:
> 
> <tmx ...>
> ...
> <body>
>   <tu id="1">
>     <tuv lang="EN-US">
>       <seg>Long sentence.</seg>
>     </tuv>
>     <tuv lang="SV-SE">
>       <seg>En lång mening.</seg>
>     </tuv>
>   </tu>
> 
>   <tu id="2">
>     <tuv lang="EN-US">
>       <seg>Short sentence.</seg>
>     </tuv>
>     <tuv lang="SV-SE">
>       <seg>Kort mening.</seg>
>     </tuv>
>   </tu>
> </body>
> </tmx>
> 
> The translations can be changed as part of the editing and proof-reading
> process, and it is obviously desirable to update the translation memory
with
> any such changes. This is easy as long as the file remains in this "double
> converted" format, since each <trans-unit> corresponds to a single
> translation unit in the translation memory. A tool can simply iterate over
> each segment in the XLIFF file, find it in the translation memory and
update
> the corresponding translation.
> 
> At some point the "double converted" XLIFF file must be converted back to
> its original XLIFF format. In this example it will look like this:
> 
>  <trans-unit id="1">
>    <source xml:lang="en-US">Long sentence. Short sentence.</source>
>    <target xml:lang="sv-SE" state="translated">En lång mening. Kort
> mening.</target>
>  </trans-unit>
> 
> Assuming that changes to the translation are needed, and the file is
updated
> to look like this:
> 
> <trans-unit id="1">
>    <source xml:lang="en-US">Long sentence. Short sentence.</source>
>    <target xml:lang="sv-SE" state="translated">Lång mening. Kort
> mening.</target>
>  </trans-unit>
> 
> Obviously it is still desirable to also update the translation memory with
> these changes, to avoid having to correct the same translation again in
the
> future. This is where we run into problems.
> - In order to update the translation memory we need to convert the source
> and target of the <trans-unit> into text segmented according to the rules
> used for the translation memory.
> 
> Unfortunately this is no longer possible.
> - It is possible to reconstruct the segmented source content, by simply
> applying the same algorithm that was used to create the original version
of
> the "double converted" XLIFF file.
> - It is NOT possible to automatically identify the translations for each
of
> the source language segments in a way that is guaranteed to be correct.
This
> requires an alignment process, which in order to succeed well requires a
> linguistic understanding of both the source and the target languages.
There
> may be more than one way to divide the target content into translations
> matching the source segments, e.g. if a sentence with new information has
> been introduced in the target. In such a case it would even be impossible
> for a human to ensure that the source and target content is correctly
> matched up in segments that correspond to the translation memory.
> 
> In the example above it looks easy to identify the corresponding target
> language segments as the individual sentences:
> 
> <trans-unit id="1.1">
>    <source xml:lang="en-US">Long sentence.</source>
>    <target xml:lang="sv-SE" state="translated">Lång mening.</target>
>  </trans-unit>
>  <trans-unit id="1.2">
>    <source xml:lang="en-US">Short sentence.</source>
>    <target xml:lang="sv-SE" state="translated">Kort mening.</target>
>  </trans-unit>
> 
> However we must not forget that this involves a process of "guessing"
which
> parts of the source and target that belongs together, as explained above.
It
> was to illustrate the problem with the "guessing" that I in my original
> example introduced a change where the long sentence had been translated
into
> two target language sentences:
> 
>  <trans-unit id="1">
>    <source xml:lang="en-US">Long sentence. Short sentence.</source>
>    <target xml:lang="sv-SE" state="final">Lång mening. Mer mening. Kort
>  mening.</target>
>  </trans-unit>
> 
> Here a tool that relies on matching up sentences between the source and
the
> target would have several options. From the tool's perspective the correct
> segmentation could be either:
> 
> <trans-unit id="1.1">
>    <source xml:lang="en-US">Long sentence.</source>
>    <target xml:lang="sv-SE" state="translated">Lång mening. Mer
> mening.</target>
>  </trans-unit>
>  <trans-unit id="1.2">
>    <source xml:lang="en-US">Short sentence.</source>
>    <target xml:lang="sv-SE" state="translated">Kort mening.</target>
>  </trans-unit>
> 
> Or:
> 
> <trans-unit id="1.1">
>    <source xml:lang="en-US">Long sentence.</source>
>    <target xml:lang="sv-SE" state="translated">Lång mening. </target>
>  </trans-unit>
>  <trans-unit id="1.2">
>    <source xml:lang="en-US">Short sentence.</source>
>    <target xml:lang="sv-SE" state="translated">Kort mening. Mer
> mening.</target>
>  </trans-unit>
> 
> Without understanding the source and target languages there is no way the
> tool can know which of these options are correct.
> 
> I hope this clarifies matters a bit.
> 
> Best regards,
> Magnus
> 
> -----Original Message-----
> From: Andrzej Zydron [mailto:azydron@xml-intl.com] 
> Sent: Saturday, March 27, 2004 11:07 AM
> To: xliff-seg@lists.oasis-open.org
> Subject: Re: [xliff-seg] Segmentation and filters
> 
> Hi Magnus,
> 
> Thank you for your reply. It looks as if we are together in total
agreement
> on 
> point 1).
> 
> As to point 2) thank you for pointing out the potential problems involved
> with
> segmentation which can arise where:
> 
> a) The initial segmentation was incorrect.
> b) The translation requires that more than one segment is rendered in a
> single 
> segment in the target language.
> 
> It would be very positive if we could agree that these two occurrences are

> definitive regarding this type of problem. In my experience they are.
> 
>  From the semantic point of view a common term such as "merged segments"
> would 
> allow us to put these into a single common category.
> 
> There is another category which I have come across, where there is no
> equivalent 
> translation possible for a given segment because for example the segment
> relates 
> to a particular address of a subsidiary that just does not exist. In
xml:tm 
> these are termed as "no-equivlent translation available" or "no-equiv"
(for
> short).
> 
> Both instances need not pose any problem during the merging process. The
> target 
> element for the first trans-unit contains the translation while the
> following 
> trans-units are flagged as being merged. Using your example the effect is
as
> 
> follows:
> 
> <trans-unit id="1.1">
>    <source xml:lang="en-US">Long sentence.</source>
>    <target xml:lang="sv-SE" state="translated">Lång mening. Mer mening.
Kort
> mening.</target>
> </trans-unit>
> <trans-unit id="1.2">
>    <source xml:lang="en-US">Short sentence.</source>
>    <target xml:lang="sv-SE" state="merged"/>
> </trans-unit>
> 
> The XLIFF target "state" attribute would require the addition of the
> "merged" 
> value, or another attribute other than state can be used. You can now use
> this 
> intelligence to load your leveraged memory that will know that if "Long 
> sentence." is followed by "Short sentence." the equivalent translated text
> is 
> "Lång mening. Mer mening. Kort mening." and requires merging. There is no 
> restriction on the number of merged segments. A merged segment always
> relates to 
> the final non-merged segment.
> 
> When the segmented XLIFF is merged back into the original XLIFF file the
> effect 
> is as follows:
> 
> <trans-unit id="1">
>    <source xml:lang="en-US"><tm:tu id="1.1">Long sentence.</tm:tu>
>                             <tm:tu id="1.2">Short sentence.</tm:tu>
>    </source>
>    <target xml:lang="sv-SE"><tm:tu id="1.1">Lång mening. Mer mening. Kort
> mening.</tm:tu>
>                             <tm:tu id="1.2" flag="merged"/>
>    </target>
> </trans-unit>
> 
> and the stripped out namespace version will look like this:
> 
> <trans-unit id="1">
>    <source xml:lang="en-US">Long sentence. Short sentence.</source>
>    <target xml:lang="sv-SE">Lång mening. Mer mening. Kort mening.</target>
> </trans-unit>
> 
> Which you can also load into memory.
> 
> This approach does not restrict or hinder the translation or loading of 
> leveraged memories in any way. In fact it can supply the translation
memory 
> software with additional hints that can automatically compensate for
> incorrect 
> segmentation should it occur again.
> 
> In a similar vein we can also handle non-equivalent segments as in:
> 
> <trans-unit id="1">
>    <source xml:lang="en-US">The address of our Florida branch is 1
Manhattan
> 
> drive, Orlando FLA123</source>
>    <target xml:lang="sv-SE" state="no-equiv"/>
> </trans-unit>
> 
> Although I suspect that there may well be another mechanism already built
> into 
> XLIFF to handle this.
> 
> XML provides such a rich vocabulary and syntax that this very easy to
> overcome 
> any segmentation issues in XLIFF.
> 
> Best regards,
> 
> AZ
> 
> 
> 
> Magnus Martikainen wrote:
> 
> 
>>Hi all,
>>
>>Thanks very much Andrzej for your clear and structured arguments and
>>examples, this is very useful for further discussions on this topic.
>>
>>Here are my comments to this thread:
>>
>>1) I agree that segmentation should not be mandatory. 
>>However I am also of the opinion that segmentation should always be
> 
> allowed,
> 
>>whether the original content was supposedly segmented during extraction or
>>not. The reason is that detection of best possible segment boundaries may
>>still need to be adapted to best fit the tools and the translation
> 
> memories
> 
>>used during the localisation, which could use slightly different
>>segmentation. (A common example would be handling of tag at sentence
>>boundaries etc.) Since the goal for the user must be to achieve maximum
>>leverage from their translation memory resources it may be necessary to
>>adjust segmentation also in such cases.
>>As a side effect of this, if we agree that we always want to "allow"
>>segmentation of the content I see no need for an explicit
>>segmented="true/false" attribute.
>>
>>
>>2) I can think of a couple of situations where a "double conversion" as
> 
> you
> 
>>are suggesting would cause problems:
>>
>>a) It applies new "hard" boundaries to the segments. Thus segmentation
>>cannot be changed later, e.g. during interactive translation. Sometimes it
>>is necessary or desirable to change the default segmentation to
> 
> accommodate
> 
>>translation needs while working on the document. Examples include:
>>- the need to adjust segmentation that has been incorrectly applied (e.g.
> 
> an
> 
>>abbreviation in the middle of a sentence that has been wrongly interpreted
>>by the segmentation tool as the end of that sentence).
>>- the occasional need to translate two or more source sentences into one
>>target language sentence for it to be a meaningful translation
>>
>>
>>b) During backward conversion of the "doubly converted" XLIFF file to its
>>original XLIFF format the segment boundaries are lost.
>>If changes are made to the content of the XLIFF file after it has been
>>converted back to its original XLIFF format it is no longer possible to
> 
> get
> 
>>those changes back into the "double converted" XLIFF document, e.g. in
> 
> order
> 
>>to update a translation memory with those changes. Once converted back,
> 
> the
> 
>>segment boundaries in both the source and target segments are gone. (The
>>source segmentation can perhaps be re-created, but it is no longer
> 
> possible
> 
>>to with certainty determine the correct corresponding target segments.)
>>
>>Example: If the "working" XLIFF file (the segmented version which is used
> 
> to
> 
>>interact with the translation memory during translation) after translation
>>contains this:
>>
>><trans-unit id="1.1">
>>  <source xml:lang="en-US">Long sentence.</source>
>>  <target xml:lang="sv-SE" state="translated">En lång mening.</target>
>></trans-unit>
>><trans-unit id="1.2">
>>  <source xml:lang="en-US">Short sentence.</source>
>>  <target xml:lang="sv-SE" state="translated">Kort mening.</target>
>></trans-unit>
>>
>>Both of these trans-units belong to the same <trans-unit> in the original
>>XLIFF file, and when the XLIFF file is converted back to its original
> 
> XLIFF
> 
>>format it could look like this (depending on the content of the skeleton
>>file):
>>
>><trans-unit id="1">
>>  <source xml:lang="en-US">Long sentence. Short sentence.</source>
>>  <target xml:lang="sv-SE" state="translated">En lång mening. Kort
>>mening.</target>
>></trans-unit>
>>
>>Now someone decides in the last minute that the translation needs to be
>>changed - the long sentence is for some reason better translated as two
>>sentences. This change is approved and signed-off. The XLIFF file is
> 
> changed
> 
>>into:
>>
>><trans-unit id="1">
>>  <source xml:lang="en-US">Long sentence. Short sentence.</source>
>>  <target xml:lang="sv-SE" state="final">Lång mening. Mer mening. Kort
>>mening.</target>
>></trans-unit>
>>
>>Unfortunately there is no way to easy way to update the translation memory
>>with these changes since the original segment boundaries that were used
>>during translation were lost. 
>>Tools can of course try to automatically "guess" the segment boundaries in
>>the source and target and somehow match them up, but this is not a trivial
>>task as can be seen from this example. There is no way an automatic tool
> 
> can
> 
>>determine if the middle sentence in the modified target should be paired
>>with the first or the last sentence (if either). Thus there is no way to
>>safely update the translation memory automatically with these changes.
>>
>>
>>c) The converted XLIFF file looses its "identity" or its direct connection
>>with the underlying data format.
>>Tools that have been developed to specifically process a particular file
>>type wrapped in XLIFF can not be used on the "converted" XLIFF file since:
>>- the original skeleton is no longer available/usable
>>- some of the content in the original XLIFF file (in particular tags
> 
> between
> 
>>sentences) has moved into the new skeleton.
>>- the new skeleton has been created with a tool and process unknown to any
>>other XLIFF tools.
>>
>>A typical example of a tool that would be useful to be able to run during
>>the localisation process is a verification/validation tool to ascertain
> 
> that
> 
>>the translated content can be converted back to a valid original format.
>>Examples of validation tools:
>>- Tag verification, validation against the schema, DTD, or other rules
> 
> that
> 
>>the content must adhere to.
>>- Length verification to ensure that translated content does not exceed
>>length limitations (which may be specified explicitly in the XLIFF file).
>>Both of these tasks require dealing with the underlying native data that
> 
> the
> 
>>XLIFF file wraps in order to perform their jobs. Due to the reasons stated
>>above the "doubly converted" XLIFF files cannot be used for this.
>>
>>
>>d) An additional level of unnecessary complexity is introduced, since it
> 
> is
> 
>>necessary to do an additional transformation/conversion of the XLIFF
>>document before it can be processed by the filter that created it.
>>In something as complex as a typical localisation project this is not a
>>factor to be neglected. If an average project has 100 files translated
> 
> into
> 
>>10 languages that means an additional 1000 file conversions necessary to
>>complete the project. If the workflow for this is not entirely automated
> 
> it
> 
>>could mean that someone may need to use a tool to manually check each
>>individual file to determine which state it is in before the files can be
>>delivered or further processed.
>>If other tools used in the localisation process use the same approach of
>>converting the XLIFF file to a new XLIFF format the complexity is
>>multiplied... All this can be avoided if we support the notion of segments
>>directly in XLIFF - then the very same XLIFF file can be used in all
> 
> stages
> 
>>of the process.
>>
>>
>>Looking forward to further comments and discussions on this topic!
>>
>>Best regards,
>>Magnus Martikainen
>>TRADOS Inc.
>>
>>-----Original Message-----
>>From: Andrzej Zydron [mailto:azydron@xml-intl.com] 
>>Sent: Tuesday, March 23, 2004 2:07 PM
>>To: xliff-seg@lists.oasis-open.org
>>Subject: [xliff-seg] Segmentation and filters
>>
>>Hi,
>>
>>First of all I would like to thank Magnus for the hard work he has put 
>>in so far and the detailed document that he has prepared. This has 
>>provided a clear starting point for further discussions.
>>
>>To kick off this thread I would like to state my views on the 
>>segmentation issue:
>>
>>1) Segmentation within XLIFF should not be mandated. It should be 
>>optional. There are implementations such as xml:tm where segmentation is 
>>done before extraction. It is also quite easy to envisage situations 
>>where XLIFF is the output of an existing translation workbench system 
>>that has already segmented and pre-matched data for sending out to a 
>>translator who will import it into an XLIFF aware editing environment.
>>
>>I can also see Magnus' point that quite often XLIFF will contain 
>>unsegmented data.
>>
>>One solution would be to provide an optional "segmented" attribute at 
>>the <file> element level which states that the data has already been 
>>segmented, with a default value of "false". If the data has been 
>>segmented than an xlink attribute to the SRX url could also be provided.
>>
>>2) One way of handling segmentation within XLIFF is to create a 
>>secondary XLIFF document from the current XLIFF document that has a 
>>separate <trans-unit> element for each segment. This would effectively 
>>be an segmentation extraction of the original XLIFF document. This has 
>>the one significant advantage that no further extensions are required to 
>>the XLIFF standard. It does away with all the potential complexity of 
>>trying to nest <trans-unit> elements or add workable syntax to cope with 
>>multiple source and target segments within a <trans-unit>.
>>
>>Because XLIFF is a well defined XML format it is very easy to write an 
>>extraction + segmentation filter for it to provide an XLIFF file where 
>>the <trans-unit> elements are at the segment level, along with a 
>>skeleton file for merging back.
>>
>>After translation you can elect to store leveraged memory at both the 
>>segmented and unsegmeted levels.
>>
>>Here is an example based on Magnus' data:
>>
>>Step 1: Original XLIFF file:
>>
>><body>
>>   <trans-unit id="1">
>>     <source xml:lang="en-US">The Document Title</source>
>>   </trans-unit>
>>   <trans-unit id="2">
>>     <source xml:lang="en-US">First sentence. <bpt 
>>id="1">[ITALIC:</bpt>This is an important sentence.<ept 
>>id="1">]</ept></source>
>>   </trans-unit>
>>   <trans-unit id="3">
>>     <source xml:lang="en-US">Ambiguous sentence. More <bpt 
>>id="1">[LINK-to-toc:</bpt>content<ept id="1">]</ept>.</source>
>>   </trans-unit>
>></body>
>>
>>Step 2: Introduce namespace segmentation into XLIFF file
>>
>><body xmlns:tm="http://www.xml-intl.com/dtd/tm.xsd";>
>>   <trans-unit id="1">
>>     <source xml:lang="en-US"><tm:tu id="1.1">The Document 
>>Title</tm:tu></source>
>>   </trans-unit>
>>   <trans-unit id="2">
>>     <source xml:lang="en-US"><tm:tu id="2.1">First sentence.</tm:tu> 
>><bpt id="1">[ITALIC:</bpt><tm:tu id="2.2">This is an important 
>>sentence.</tm:tu><ept id="1">]</ept></source>
>>   </trans-unit>
>>   <trans-unit id="3">
>>     <source xml:lang="en-US"><tm:tu id="3.1">Ambiguous 
>>sentence.</tm:tu> <tm:tu id="3.2">More <bpt 
>>id="1">[LINK-to-toc:</bpt>content<ept id="1">]</ept>.</tm:tu></source>
>>   </trans-unit>
>></body>
>>
>>Step 3: Using a simple XSLT transformation create new segmented XLIFF
> 
> file:
> 
>><body segmented="true" srx="http://www.xml-intl.com/srx/en-US.srx";>
>>   <trans-unit id="1.1">
>>     <source xml:lang="en-US">The Document Title</source>
>>   </trans-unit>
>>   <trans-unit id="2.1">
>>     <source xml:lang="en-US">First sentence.</source>
>>   </trans-unit>
>>   <trans-unit id="2.2">
>>     <source xml:lang="en-US">This is an important sentence.</source>
>>   </trans-unit>
>>   <trans-unit id="3.1">
>>     <source xml:lang="en-US">Ambiguous sentence.</source>
>>   </trans-unit>
>>   <trans-unit id="3.1">
>>     <source xml:lang="en-US">More <bpt 
>>id="1">[LINK-to-toc:</bpt>content<ept id="1">]</ept>.</source>
>>   </trans-unit>
>></body>
>>
>>And Skeleton file:
>>
>><body xmlns:tm="http://www.xml-intl.com/dtd/tm.xsd";>
>>   <trans-unit id="1">
>>     <source xml:lang="en-US"><tm:tu id="1.1"><ext 
>>id="1.1"/></tm:tu></source>
>>   </trans-unit>
>>   <trans-unit id="2">
>>     <source xml:lang="en-US"><tm:tu id="2.1"><ext id="2.1"/></tm:tu> 
>><bpt id="1">[ITALIC:</bpt><tm:tu id="2.2"><ext id="2.2"/></tm:tu><ept 
>>id="1">]</ept></source>
>>   </trans-unit>
>>   <trans-unit id="3">
>>     <source xml:lang="en-US"><tm:tu id="3.1"><ext id="3.1"/></tm:tu> 
>><tm:tu id="3.2"><ext id="3.2"/></tm:tu></source>
>>   </trans-unit>
>></body>
>>
>>Step 3: Put segmented XLIFF file through whatever matching process you 
>>want to, to produce:
>>
>><body segmented="true" srx="http://www.xml-intl.com/srx/en-US.srx";>
>>   <trans-unit id="1.1">
>>     <source xml:lang="en-US">The Document Title</source>
>>     <target xml:lang="sv-SE" state="translated" 
>>state-qualifier="leveraged-tm">Dokumentrubriken</target>
>>   </trans-unit>
>>   <trans-unit id="2.1">
>>     <source xml:lang="en-US">First sentence.</source>
>>     <target xml:lang="sv-SE" state="translated" 
>>state-qualifier="leveraged-tm">Första meningen.</target>
>>   </trans-unit>
>>   <trans-unit id="2.2">
>>     <source xml:lang="en-US">This is an important sentence.</source>
>>       <alt-trans origin="transation memory" match-quality="80%">
>>         <source xml:lang="en-US">This is an extremely important 
>>sentence.</source>
>>         <target xml:lang="sv-SE">En mycket viktig mening.</target>
>>       </alt-trans>
>>   </trans-unit>
>>   <trans-unit id="3.1">
>>     <source xml:lang="en-US">Ambiguous sentence.</source>
>>     <target xml:lang="sv-SE" state="needs-review-translation">Omstridd 
>>mening.</target>
>>       <note annotates="target" from="Swedish Translator">This 
>>translation may not be appropriate. Please evaluate it carefully!</note>
>>   </trans-unit>
>>   <trans-unit id="3.1">
>>     <source xml:lang="en-US">More <bpt 
>>id="1">[LINK-to-toc:</bpt>content<ept id="1">]</ept>.</source>
>>     <taget xml:lang="sv-SE" state="translated">Ytterligare <bpt 
>>id="1">[LINK-to-toc:</bpt>inneha*ll<ept id="1">]</ept>.</target>
>>   </trans-unit>
>></body>
>>
>>
>>Step 4: Using nothing more than XSLT, merge the translated document 
>>back, then strip out the segmented namespace elements using another 
>>simple XSLT transformation and you arrive at a translated XLIFF file 
>>that is equal to the original source language unsegmented file.
>>
>>This approach has the benefit of requiring minimal or possibly no change 
>>to the existing excellent XLIFF specification.
>>
>>Hope this helps kick off the thread.
>>
>>Regards,
>>
>>AZ
>>
> 
> 

-- 


email - azydron@xml-intl.com
smail - Mr. A.Zydron
         24 Maybrook Gardens,
         High Wycombe,
         Bucks HP13 6PJ
Mobile +(44) 7966 477181
FAX    +(44) 870 831 8868
www - http://www.xml-intl.com

This message contains confidential information and is intended only
for the individual named.  If you are not the named addressee you
may not disseminate, distribute or copy this e-mail.  Please
notify the sender immediately by e-mail if you have received this
e-mail by mistake and delete this e-mail from your system.
E-mail transmission cannot be guaranteed to be secure or error-free
as information could be intercepted, corrupted, lost, destroyed,
arrive late or incomplete, or contain viruses.  The sender therefore
does not accept liability for any errors or omissions in the contents
of this message which arise as a result of e-mail transmission.  If
verification is required please request a hard-copy version. Unless
explicitly stated otherwise this message is provided for informational
purposes only and should not be construed as a solicitation or offer.
Follow-Ups:
- Re: [xliff-seg] Segmentation and filters
  - From: Andrzej Zydron <azydron@xml-intl.com>