OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

xliff-seg message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: Re: [xliff-seg] Segmentation and filters


Hi Magnus,

Firstly can you accept my apologies regarding tomorrow's meeting. I will not be 
able to attend as I have had to reschedule my time to make the Trans Web 
services face to face. Doubtless as you say we should be able to get together to 
discuss some of the issues in Dublin.

Regarding your reply could you explain how the target content would be updated 
without the knowledge of the translation supplier. The customer updates the 
source. Then sends the file for translation. The recipient of the XLIFF file 
does the translation. How else does the target get updated?

Lets discuss in Dublin,

Best Regards,

AZ

Magnus Martikainen wrote:

> Hi Andrzej,
> 
> I'm afraid you must have misunderstood the issue I am talking about. In your
> example below you describe the case where <source> content has been updated
> by the content owner, while the scenario I am talking about is where the
> <target> content has been updated (the <source> stays the same).
> 
> Could I please ask you to study my original example in more detail and show
> me how you would propose to handle <target> only changes?
> 
> Best regards,
> Magnus
> 
> -----Original Message-----
> From: Andrzej Zydron [mailto:azydron@xml-intl.com] 
> Sent: Sunday, April 25, 2004 1:48 PM
> To: Magnus Martikainen
> Cc: xliff-seg@lists.oasis-open.org
> Subject: Re: [xliff-seg] Segmentation and filters
> 
> Hi Magnus,
> 
> The system relies on the Translation supplier maintaining the Source and
> Target 
> segmented namespace versions of the document. The process introduces the 
> segmented namespace into the updated file and then resolves the id's between
> the 
> original source and the updated source. The segments which are unchanged are
> in 
> effect matched and the target text can be automatically inserted. Here is a 
> simplified version of the process:
> 
> 1) Original Source XLIFF file:
> 
> <trans-unit id="1">
>     <source xml:lang="en-US">Long sentence. Short sentence.</source>
>     <target xml:lang="sv-SE">Long sentence. Short sentence.</target>
> </trans-unit>
> 
> 2) Introduce segmentation namespace
> 
> <trans-unit id="1">
>     <source xml:lang="en-US">Long sentence. Short sentence.</source>
>     <target xml:lang="sv-SE"><tm:tu id="1.1">Long sentence.</tm:tu><tm:tu 
> id="1.2">Short sentence.</tm:tu></target>
> </trans-unit>
> 
> 3) Second level XLIFF file:
> 
> <trans-unit id="1.1">
>     <source xml:lang="en-US">Long sentence.</source>
>     <target xml:lang="sv-SE">Long sentence.</tm:tu></target>
> </trans-unit>
> <trans-unit id="1.2">
>     <source xml:lang="en-US">Short sentence.</source>
>     <target xml:lang="sv-SE">Short sentence.</target>
> </trans-unit>
> 
> 4) Translation of second level XLIFF file:
> 
> <trans-unit id="1.1">
>     <source xml:lang="en-US">Long sentence.</source>
>     <target xml:lang="sv-SE">Lång mening.</tm:tu></target>
> </trans-unit>
> <trans-unit id="1.2">
>     <source xml:lang="en-US">Short sentence.</source>
>     <target xml:lang="sv-SE">Mer mening. Kort mening.</target>
> </trans-unit>
> 
> 5) Merging of translated text:
> 
> <trans-unit id="1">
>     <source xml:lang="en-US">Long sentence. Short sentence.</source>
>     <target xml:lang="sv-SE"><tm:tu id="1.1">Lång mening.</tm:tu><tm:tu 
> id="1.2">Mer mening. Kort mening.</tm:tu></target>
> </trans-unit>
> 
> 6) Striping out of namespace for return to supplier:
> 
> <trans-unit id="1">
>     <source xml:lang="en-US">Long sentence. Short sentence.</source>
>     <target xml:lang="sv-SE">Lång mening. Mer mening. Kort mening.</target>
> </trans-unit>
> 
> 7) Updated file from customer:
> 
> <trans-unit id="1">
>     <source xml:lang="en-US">Long sentence. Short sentence. Another 
> sentence.</source>
>     <target xml:lang="sv-SE">Long sentence. Short sentence. Another 
> sentence.</target>
> </trans-unit>
> 
> 8) Introduce segmentation namespace
> 
> <trans-unit id="1">
>     <source xml:lang="en-US">Long sentence. Short sentence. Another 
> sentence.</source>
>     <target xml:lang="sv-SE">tm:tu id="1.1">Long sentence.</tm:tu><tm:tu 
> id="1.2">Short sentence.</tm:tu><tm:tu id="1.3">Another
> sentence.</tm:tu></target>
> </trans-unit>
> 
> 9) Run DOM differencing on tm sengmented namespace to resolve new to old id 
> mapping. In this case the id's map exactly. Comparison is on the original 
> pre-translated versions of the segmented namespace files e.g. this one:
> 
> <trans-unit id="1">
>     <source xml:lang="en-US">Long sentence. Short sentence.</source>
>     <target xml:lang="sv-SE"><tm:tu id="1.1">Long sentence.</tm:tu><tm:tu 
> id="1.2">Short sentence.</tm:tu></target>
> </trans-unit>
> 
> 10) Carry over translation and create new second level XLIFF file:
> 
> <trans-unit id="1.1">
>     <source xml:lang="en-US">Long sentence.</source>
>     <target xml:lang="sv-SE" state-qualifier="exact-match">Lång 
> mening.</tm:tu></target>
> </trans-unit>
> <trans-unit id="1.2">
>     <source xml:lang="en-US">Short sentence.</source>
>     <target xml:lang="sv-SE" state-qualifier="exact-match">Mer mening. Kort 
> mening.</target>
> </trans-unit>
> <trans-unit id="1.3">
>     <source xml:lang="en-US">Another sentence.</source>
>     <target xml:lang="sv-SE">Another sentence.</target>
> </trans-unit>
> 
> I am going to be in Dublin on Thursday and Friday for the Trans Web Services
> TC 
> face to face. I remember Peter mentioning that you will be there too. If you
> 
> will be at the Bowne offices I can go through this with you in great detail.
> 
> Best Regards,
> 
> AZ
> 
> 
> Magnus Martikainen wrote:
> 
>>Hi Andrzej,
>>
>>Could you please show how the problem in my example would be handled with
>>XLIFF only?
>>
>>Cheers,
>>Magnus
>>
>>-----Original Message-----
>>From: Andrzej Zydron [mailto:azydron@xml-intl.com] 
>>Sent: Sunday, April 18, 2004 10:35 AM
>>To: Magnus Martikainen
>>Subject: Re: [xliff-seg] Segmentation and filters
>>
>>Hi Magnus,
>>
>>Apologies for the time it has taken me to get back to you but again other 
>>priorities have intervened.
>>
>>In order to see the answer you need to think in XML and not in traditional
> 
> 
>>translation memory terms - it is an easy trap to fall into. XML changes
> 
> the 
> 
>>rules completely and provides us with an elegant and reliable grammar to
>>work with.
>>
>>To handle the lookup effectively you need to hold the original version of
>>the 
>>source data in a segmented namespace form. Do not make an issue of the
>>segmented 
>>namespace - it is only a problem if you really want it to be a problem. If
>>you 
>>are used to working with XML it is no problem at all. You then introduce
> 
> the
> 
>>segmented namespace into the updated source version. You then do a XML DOM
> 
> 
>>difference on the namespace version of the two files at the source level.
>>You 
>>can then resolve the ID references and now have the resolution of original
>>to 
>>new document resolved including all matching.
>>
>>This takes a completely different approach than you may have been used to.
>>For a 
>>full explanation of the mechanics involved please refer to the article
> 
> about
> 
>>xml:tm on xml.com:
>>
>>http://www.xml.com/pub/a/2004/01/07/xmltm.html
>>
>>The URL for xml:tm specification which provides some additional detail is:
>>
>>http://www.xml-intl.com/docs/specification/xml-tm.html
>>
>>Best Regards,
>>
>>AZ
>>
>>
>>Magnus Martikainen wrote:
>>
>>
>>>Hi Andrzej,
>>>
>>>I just noticed a cut-and-paste problem in my example. The correct version
>>
>>of
>>
>>
>>>Z alternative 2 should read:
>>>-----------------------------
>>>(Z alternative 2) The new sentence in <target> belongs to the second
>>>translation unit:
>>>
>>><trans-unit id="1.1">
>>>  <source xml:lang="en-US">Long sentence.</source>
>>>  <target xml:lang="sv-SE" state="translated">Lång mening.</target>
>>></trans-unit>
>>><trans-unit id="1.2">
>>>  <source xml:lang="en-US">Short sentence.</source>
>>>  <target xml:lang="sv-SE" state="translated">Mer
>>>mening. Kort mening.</target>
>>></trans-unit>
>>>-----------------------------
>>>
>>>Sorry for the confusion.
>>>
>>>Magnus
>>>
>>>-----Original Message-----
>>>From: Magnus Martikainen [mailto:magnus@trados.com] 
>>>Sent: Monday, April 12, 2004 5:06 PM
>>>To: Andrzej Zydron
>>>Cc: xliff-seg@lists.oasis-open.org
>>>Subject: RE: [xliff-seg] Segmentation and filters
>>>
>>>Hi Andrzej,
>>>
>>>Thanks for your reply. Perhaps we are starting to reach a better
>>>understanding of segmentation now. Let me explain:
>>>
>>>Regarding your request about a "merged" attribute for translation units I
>>
>>am
>>
>>
>>>afraid that in the current version of XLIFF this will introduce
>>>difficulties.
>>>The problem stems from the fact that the division of the translatable
>>>content into <trans-unit>s and skeleton is a process that in XLIFF is left
>>>entirely up to the filter that produces the XLIFF file. 
>>>No tool that process XLIFF files (other than the filter) can make any
>>>assumptions about the relations between two subsequent <trans-unit>
>>
>>elements
>>
>>
>>>in the XLIFF file.
>>>
>>>Thus in your example there is in fact no way any XLIFF editing tool can
>>>determine that <trans-unit id="1.1"> and <trans-unit id="1.2"> can be
>>
>>merged
>>
>>
>>>and translated as one piece of text. They could be totally unrelated. E.g.
>>>the first <trans-unit> could be a document heading and the second one
>>
>>could
>>
>>
>>>be the content of a table cell, or a call-out for an image that appears
>>>somewhere under the heading.
>>>
>>>Since the relation between <trans-unit> elements is undefined by XLIFF and
>>>hidden by the filter (in the skeleton) we can never permit two
>>
>><trans-unit>
>>
>>>elements to be translated as one, since they may in fact not be related at
>>>all. 
>>>That is why I refer to the <trans-unit> as a "hard" segmentation boundary.
>>>They are "set in stone" by the filter that produces the XLIFF, due to the
>>>fact that the use of the skeleton mechanism is undefined.
>>>
>>>For it to be possible to merge two <trans-unit>s we would need to
>>
>>introduce
>>
>>
>>>a mechanism to indicate that the boundaries between certain subsequent
>>><trans-unit> elements can be treated differently (i.e. as "softer"
>>>boundaries). This is to a large extent what segmentation support in XLIFF
>>
>>is
>>
>>
>>>all about...
>>>In fact an alternate representation of segmentation in XLIFF than the one
>>
>>I
>>
>>
>>>used in my document could be based on such a mechanism. (That is a
>>
>>different
>>
>>
>>>discussion - let's leave it for now, until we have come to a consensus
>>
>>about
>>
>>
>>>the need for segmentation and the different scenarios and use cases for
>>
>>it.)
>>
>>
>>>I hope this also clarifies a bit more why I distinguish between "hard" and
>>>"soft" segmentation boundaries, as you mention in B) below.
>>>
>>>
>>>A) (i) To clarify that we are indeed talking about the same issue here,
>>>could you perhaps explain the mechanism you use to automatically transfer
>>><target> element specific changes in one version of XLIFF (X) into a
>>>differently segmented, translated XLIFF file (Y) to produce an updated
>>>version of that segmented XLIFF file (Z) as in the example I described:
>>>
>>>(X) Updated XLIFF file in original format (this is what the localisation
>>>agency received as update from content owner in my example):
>>>
>>><trans-unit id="1">
>>>  <source xml:lang="en-US">Long sentence. Short sentence.</source>
>>>  <target xml:lang="sv-SE" state="final">Lång mening. Mer mening. Kort
>>>mening.</target>
>>></trans-unit>
>>>
>>>
>>>(Y) Differently Segmented XLIFF, original version, in working format (the
>>>translation agency's working file with segmentation corresponding to the
>>>translation memory, as it were before it was converted to original format
>>>and delivered to the content owner):
>>>
>>><trans-unit id="1.1">
>>>  <source xml:lang="en-US">Long sentence.</source>
>>>  <target xml:lang="sv-SE" state="translated">Lång mening.</target>
>>></trans-unit>
>>><trans-unit id="1.2">
>>>  <source xml:lang="en-US">Short sentence.</source>
>>>  <target xml:lang="sv-SE" state="translated">Kort mening.</target>
>>></trans-unit>
>>>
>>>
>>>I can see no way that this can be correctly and safely handled in an
>>>automated fashion. The file (Z) could be either:
>>>
>>>(Z alternative 1) The new sentence in <target> belongs to the first
>>>translation unit:
>>>
>>><trans-unit id="1.1">
>>>  <source xml:lang="en-US">Long sentence.</source>
>>>  <target xml:lang="sv-SE" state="translated">Lång mening. Mer
>>>mening.</target>
>>></trans-unit>
>>><trans-unit id="1.2">
>>>  <source xml:lang="en-US">Short sentence.</source>
>>>  <target xml:lang="sv-SE" state="translated">Kort mening.</target>
>>></trans-unit>
>>>
>>>
>>>Or:
>>>
>>>(Z alternative 2) The new sentence in <target> belongs to the second
>>>translation unit:
>>>
>>><trans-unit id="1.1">
>>>  <source xml:lang="en-US">Long sentence.</source>
>>>  <target xml:lang="sv-SE" state="translated">Lång mening. </target>
>>></trans-unit>
>>><trans-unit id="1.2">
>>>  <source xml:lang="en-US">Short sentence.</source>
>>>  <target xml:lang="sv-SE" state="translated">Kort mening. Mer
>>>mening.</target>
>>></trans-unit>
>>>
>>>Or even something completely different. As far as I can see neither a
>>
>>human
>>
>>
>>>nor a computer may with absolute certainty be able to determine how this
>>>should be handled.
>>>
>>>
>>>A) (ii) If you convert the file to a format that is supported by the
>>>validation tool, how would you easily handle the situation where the
>>>validation tool also alters the file (e.g. it may have an "auto-fix"
>>
>>feature
>>
>>
>>>for certain common problems)?
>>>
>>>
>>>Looking forward to your comments!
>>>
>>>Best regards,
>>>Magnus
>>>
>>>-----Original Message-----
>>>From: Andrzej Zydron [mailto:azydron@xml-intl.com] 
>>>Sent: Sunday, April 11, 2004 2:13 PM
>>>Cc: xliff-seg@lists.oasis-open.org
>>>Subject: Re: [xliff-seg] Segmentation and filters
>>>
>>>Hi Magnus,
>>>
>>>Many thanks for your reply.
>>>
>>>The discussions to date have been very useful and have allowed me to
>>
>>explore
>>
>>
>>>in 
>>>my own mind some of the issues involved more thoroughly.
>>>
>>>As a consequence I have one firm request of the XLIFF TC and that is for 
>>>clarification regarding a mechanism for signifying that a translator
>>>requires 
>>>that a translation of a <trans-unit> is "merged" with one or more
>>
>>preceding 
>>
>>
>>><trans-unit>.
>>>
>>>In the following example I have extended the "translate" attribute merely
>>
>>to
>>
>>
>>>show the type of effect required:
>>>
>>><trans-unit id="1.1">
>>><source>Badly worded source text first sentence</source>
>>><target>Zle wyslowione zdanie pierwsze oraz drugie jako jedno
>>>zdanie</target>
>>></trans-unit>
>>><trans-unit id="1.2" translate="merged">
>>><source>Second part of badly worded sentence</source>
>>><target/>
>>></trans-unit>
>>>
>>>
>>>
>>>A) Regarding Magnus' comments on the use of multiple converted XLIFF
>>
>>files:
>>
>>
>>>i. I am surprised by your statement that "there is no way to automatically
>>
>>
>>>transfer the changes from the XLIFF file received from the content owner
>>>into 
>>>the previously translated segmented XLIFF file". I have been doing just
>>
>>this
>>
>>
>>>on 
>>>a daily basis for the past 4 years.
>>>
>>>ii. I must also take issue with the statement "the use of
>>>specialized tools, e.g. to validate or adapt the underlying data in the
>>>original XLIFF format, cannot be used directly on the segmented version of
>>>the file". It is only a problem if you want it to be a problem. Please
>>>remember 
>>>that XML is extremely flexible. If the presence of a namespace precludes
>>>certain 
>>>activities then you just remove it in the version of the file that is used
>>>for 
>>>validation. Alternatively the XLIFF specification could allow for the
>>>specific 
>>>presence of the required namespace.
>>>
>>>
>>>B) Regarding this point:
>>>
>>>I would like to posit that there is no such thing as "soft" or "hard" 
>>>segmentation. This artificial taxonomy merely serves to complicate the 
>>>discussion. There is just segmentation. You may wish to disregard the
>>>original 
>>>segmentation or preserve it but in the end it is all segmentation and
>>
>>there
>>
>>
>>>is 
>>>nothing "soft" or "hard" about it.
>>>
>>>Best Regards,
>>>
>>>AZ
>>>
>>>
>>>Regarding your answers
>>>
>>>Magnus Martikainen wrote:
>>>
>>>
>>>
>>>
>>>>Hi Andrzej,
>>>>
>>>>Thank you for your email - I really appreciate you taking time on this.
>>>>
>>>>A) Regarding the use of multiple converted XLIFF files:
>>>>
>>>>I am not sure I fully understand what you are saying, so please correct
> 
> me
> 
>>>>if I'm wrong.
>>>>
>>>>I assume that what you are suggesting is that the segmented (i.e. double
>>>>XLIFF converted) file should be "the" XLIFF file from this point on, and
>>>>that it is never again necessary to work on the original XLIFF format?
>>>>
>>>>I can see a couple of important issues with that. For example consider
>>>
>>>this
>>>
>>>
>>>
>>>>scenario: 
>>>>- The original XLIFF file is produced by the content owner and handed off
>>>
>>>to
>>>
>>>
>>>
>>>>a localisation agency for translation. 
>>>>- The localisation agency segments the XLIFF file in order to achieve
>>>>maximum reuse from the translation memory.
>>>>- The segmented XLIFF file is processed and translated, and is ready to
> 
> be
> 
>>>>handed off to the content owner.
>>>>
>>>>The content owner would clearly expect to receive an XLIFF file of the
>>>
>>>same
>>>
>>>
>>>
>>>>type that they handed off to translation. If they receive the segmented
>>>>XLIFF file they may not even have tools to be able to get their content
>>>
>>>back
>>>
>>>
>>>
>>>>into the system it originates from. 
>>>>
>>>>Thus, in order to deliver the XLIFF file to the content owner the
>>>>localisation agency must convert it back to its original format.
>>>>
>>>>We now have a situation where for the content owner "the" XLIFF file is
>>>
>>>not
>>>
>>>
>>>
>>>>the segmented XLIFF file, but rather the file they received back from the
>>>>localisation agency, while for the localisation agency the segmented
> 
> XLIFF
> 
>>>>file is still "the" XLIFF file (since it corresponds to the segmentation
>>>>used in the translation memory).
>>>>
>>>>Assuming that the content owner needs to make changes to the XLIFF
>>>>translated file for one reason or another. It is clearly desirable for
> 
> the
> 
>>>>localisation agency to update their linguistic assets with these changes
>>>
>>>so
>>>
>>>
>>>
>>>>that the same changes never need to be made again. For this reason the
>>>>content owner sends the updated XLIFF file to the localisation agency.
>>>>
>>>>In this situation there is no way to automatically transfer the changes
>>>
>>>from
>>>
>>>
>>>
>>>>the XLIFF file received from the content owner into the previously
>>>>translated segmented XLIFF file (assuming the localisation agency has
> 
> kept
> 
>>>>it). This fact is independent of whether the data is in XML or any other
>>>>format. The problem is that linguistic knowledge is required to safely
> 
> and
> 
>>>>correctly identify source and target translation pairs from two bodies of
>>>>text. This we can never achieve with XML transformations...
>>>>
>>>>
>>>>Another important issue that I previously pointed out is that the use of
>>>>specialised tools, e.g. to validate or adapt the underlying data in the
>>>>original XLIFF format, cannot be used directly on the segmented version
> 
> of
> 
>>>>the file. If the content owner provides such a tool to the localisation
>>>>agency it may not be easy for them to use it.
>>>>
>>>>
>>>>All of these problems go away if we can keep the data in one XLIFF format
>>>>throughout the localisation process. This reduces complexity immensely
> 
> for
> 
>>>>the whole process.
>>>>
>>>>
>>>>B) Regarding your additional comments:
>>>>
>>>>1) I fully and completely agree with you that there will always be cases
>>>>where segmentation must be adjusted or adopted afterwards.
>>>>In fact that is the major reason that I suggest treating the segment
>>>>boundaries as "soft" boundaries inside a <trans-unit>, as opposed to the
>>>>"hard" boundaries of the <trans-unit>.
>>>>As you correctly observe, the segment boundaries can be reconfigured by
>>>
>>>any
>>>
>>>
>>>
>>>>tool that process the files. That is the whole point, and that is why I
>>>>refer to them as "soft" boundaries.
>>>>The difference between changing segmentation for <trans-unit> and "soft"
>>>>segments is that the later does not require the creation of a new XLIFF
>>>>format. Any tools for processing the original XLIFF format can still be
>>>>used, even though the linguistic content has been segmented.
>>>>
>>>>2) The document introduces the <segment> element as one suggestion on how
>>>>the segmentation mechanism could be implemented in XLFF - simply because
>>>
>>>it
>>>
>>>
>>>
>>>>would be nearly impossible to give any examples if I did not choose one
>>>
>>>way
>>>
>>>
>>>
>>>>to express the segmentation. However I hope I also made it clear that
> 
> this
> 
>>>>is just one of the possibilities we have for implementation, there are
>>>
>>>many
>>>
>>>
>>>
>>>>other options which we should carefully consider and discuss once we have
>>>>all agreed that there is an actual need for it.
>>>>
>>>>My suggestion would be to focus the discussion on the "why" until we
> 
> reach
> 
>>>>consensus and then look closer into the "how". 
>>>>
>>>>Thanks again for taking your time on this.
>>>>
>>>>Cheers,
>>>>Magnus
>>>>
>>>>
>>>>-----Original Message-----
>>>>From: Andrzej Zydron [mailto:azydron@xml-intl.com] 
>>>>Sent: Thursday, April 01, 2004 12:32 PM
>>>>To: xliff-seg@lists.oasis-open.org
>>>>Subject: Re: [xliff-seg] Segmentation and filters
>>>>
>>>>Hi Magnus,
>>>>
>>>>Thank you for your reply. Sorry it has taken me a few days to reply, but
> 
> I
> 
>>>>have
>>>>been very busy and the issues raised are quite complex.
>>>>
>>>>Having spent a long time analyzing the contents of your email I still do
>>>
>>>not
>>>
>>>
>>>
>>>>quite understand your statement "- It is NOT possible to automatically
>>>>identify
>>>>the translations for each of the source language segments in a way that
> 
> is
> 
>>>>guaranteed to be correct.": We have the alignment in detail in the
>>>
>>>segmented
>>>
>>>
>>>
>>>>XLIFF file. We also have it in the merged segmented namespace version of
>>>
>>>the
>>>
>>>
>>>
>>>>original XLIFF file.
>>>>
>>>>I think (correct me if I am wrong) that you are disregarding the
> 
> segmented
> 
>>>>XLIFF
>>>>file completely from the equation as concerns building leveraged memory.
>>>>This
>>>>would be wrong - the segmented XLIFF file is in fact "the" XLIFF file.
> 
> The
> 
>>>>original version is merely an XML document that we have extracted from.
>>>>
>>>>What you must not loose sight of is that an XLIFF file is an XML file and
>>>>can be
>>>>treated just as any XML file can.
>>>>
>>>>I have also reread your original HTML document in detail and have the
>>>>following
>>>>observations:
>>>>
>>>>1) I fail to see any real difference between the so called "hard
>>>
>>>boundaries"
>>>
>>>
>>>
>>>>and
>>>>what is being proposing. I think there is a danger here of confusing a
>>>>standard 
>>>>with a methodology. This confusion exists at two levels:
>>>>
>>>>i. At the standard level any element that is permissible at a given point
>>>
>>>in
>>>
>>>
>>>
>>>>the
>>>>DOM can be used. So if you allow <segment> as a child of <trans-unit>
>>>
>>>there
>>>
>>>
>>>
>>>>is
>>>>no control over who and how will use it.
>>>>ii. At the application level the document seems to imply that your way of
>>>>doing
>>>>segmentation is always going to be without any need for correction. This
>>>
>>>is
>>>
>>>
>>>
>>>>not
>>>>stated in so many words which is what makes matters more confusing, but
> 
> it
> 
>>>>is what I have implied from the fact that no corrective mechanism for 
>>>>segmentation errors has been provided. Segmentation errors (just like
>>>
>>>death
>>>
>>>
>>>
>>>>and 
>>>>taxes) are one of the few certainties in life because:
>>>>
>>>>a) No segmentation algorithm however well conceived can account for all
>>>>eventualities. There is always a possibility of a new set of
> 
> circumstances
> 
>>>>which
>>>>will result in incorrect segmentation. It is the nature of the infinite
>>>>possible
>>>>combination of words/acronyms/abbreviations etc. that are possible in
>>>>translatable text.
>>>>b) No segmentation algorithm can account for badly authored source text
>>>
>>>that
>>>
>>>
>>>
>>>>defies simple segment for segment translation.
>>>>c) There is no control over how the <segment> element will be used. You
>>>>cannot
>>>>mandate any single algorithm which in any event will by its very nature
>>>>never be
>>>>totally correct.
>>>>d) Once <segment> elements are introduced they will always become "hard
>>>>boundaries" because the segmented XLIFF file may be sent to another
>>>>supplier. 
>>>>One cannot assume that this will always occur within one system.
>>>>
>>>>So at the very least the <segment> element will require a corrective
>>>>mechanism
>>>>for merging <segment> elements.
>>>>
>>>>The problem with the introduction of the <segment> element is that it
>>>>imposes a
>>>>compatibility problem for companies who have already written software to
>>>>handle
>>>>XLIFF files. The relationship of the <trans-unit> to <source> and
> 
> <target>
> 
>>>>elements is an important one which many companies have written into their
> 
> 
>>>>software. With the advent of <segment> elements this relationship becomes
>>>>much 
>>>>more complex.
>>>>
>>>>For me personally it is not such an issue as I would just do a secondary
>>>>extraction anyway.
>>>>
>>>>
>>>>Best Regards,
>>>>
>>>>AZ
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>>Hi Andrzej,
>>>>>
>>>>>Thank you for your reply. I'm afraid my explanation of the issue with
>>>>>identifying segment boundaries after backward conversion of a "double
>>>>>converted" XLIFF file may not have been clear enough. I believe I
>>
>>confused
>>
>>
>>>>>matters by introducing unnecessary complexity where a single source
>>>>
>>>>language
>>>>
>>>>
>>>>
>>>>
>>>>>sentence is translated as more than one sentence. 
>>>>>
>>>>>My intention was to illustrate the fact that once the "double converted"
>>>>>XLIFF file has been converted back to its initial XLIFF format there is
>>
>>no
>>
>>
>>>>>way to safely bring it back to its state as a "double converted" XLIFF
>>>>
>>>>file
>>>>
>>>>
>>>>
>>>>
>>>>>again. This may be necessary e.g. in order to update a translation
> 
> memory
> 
>>>>>with changes made after the backward conversion.
>>>>>
>>>>>Let me simplify my original example to demonstrate the issue.
>>>>>
>>>>>Assume that the translated "double converted" XLIFF file looks like in
>>
>>the
>>
>>
>>>>>original example:
>>>>>
>>>>><trans-unit id="1.1">
>>>>><source xml:lang="en-US">Long sentence.</source>
>>>>><target xml:lang="sv-SE" state="translated">En lång mening.</target>
>>>>></trans-unit>
>>>>><trans-unit id="1.2">
>>>>><source xml:lang="en-US">Short sentence.</source>
>>>>><target xml:lang="sv-SE" state="translated">Kort mening.</target>
>>>>></trans-unit>
>>>>>
>>>>>The reason we segmented the file was to achieve the best possible
>>>>
>>>>recycling
>>>>
>>>>
>>>>
>>>>
>>>>>of previous translations from a translation memory, so the each
>>
>>trans-unit
>>
>>
>>>>>now perfectly matches the segmentation used in the TM.
>>>>>When translation is finished the translation memory will contain one
>>>>>translation unit for each of these segments. If exported to TMX it could
>>>>>look like this:
>>>>>
>>>>><tmx ...>
>>>>>...
>>>>><body>
>>>>><tu id="1">
>>>>> <tuv lang="EN-US">
>>>>>   <seg>Long sentence.</seg>
>>>>> </tuv>
>>>>> <tuv lang="SV-SE">
>>>>>   <seg>En lång mening.</seg>
>>>>> </tuv>
>>>>></tu>
>>>>>
>>>>><tu id="2">
>>>>> <tuv lang="EN-US">
>>>>>   <seg>Short sentence.</seg>
>>>>> </tuv>
>>>>> <tuv lang="SV-SE">
>>>>>   <seg>Kort mening.</seg>
>>>>> </tuv>
>>>>></tu>
>>>>></body>
>>>>></tmx>
>>>>>
>>>>>The translations can be changed as part of the editing and proof-reading
>>>>>process, and it is obviously desirable to update the translation memory
>>>>
>>>>with
>>>>
>>>>
>>>>
>>>>
>>>>>any such changes. This is easy as long as the file remains in this
>>
>>"double
>>
>>
>>>>>converted" format, since each <trans-unit> corresponds to a single
>>>>>translation unit in the translation memory. A tool can simply iterate
>>
>>over
>>
>>
>>>>>each segment in the XLIFF file, find it in the translation memory and
>>>>
>>>>update
>>>>
>>>>
>>>>
>>>>
>>>>>the corresponding translation.
>>>>>
>>>>>At some point the "double converted" XLIFF file must be converted back
> 
> to
> 
>>>>>its original XLIFF format. In this example it will look like this:
>>>>>
>>>>><trans-unit id="1">
>>>>><source xml:lang="en-US">Long sentence. Short sentence.</source>
>>>>><target xml:lang="sv-SE" state="translated">En lång mening. Kort
>>>>>mening.</target>
>>>>></trans-unit>
>>>>>
>>>>>Assuming that changes to the translation are needed, and the file is
>>>>
>>>>updated
>>>>
>>>>
>>>>
>>>>
>>>>>to look like this:
>>>>>
>>>>><trans-unit id="1">
>>>>><source xml:lang="en-US">Long sentence. Short sentence.</source>
>>>>><target xml:lang="sv-SE" state="translated">Lång mening. Kort
>>>>>mening.</target>
>>>>></trans-unit>
>>>>>
>>>>>Obviously it is still desirable to also update the translation memory
>>
>>with
>>
>>
>>>>>these changes, to avoid having to correct the same translation again in
>>>>
>>>>the
>>>>
>>>>
>>>>
>>>>
>>>>>future. This is where we run into problems.
>>>>>- In order to update the translation memory we need to convert the
> 
> source
> 
>>>>>and target of the <trans-unit> into text segmented according to the
> 
> rules
> 
>>>>>used for the translation memory.
>>>>>
>>>>>Unfortunately this is no longer possible.
>>>>>- It is possible to reconstruct the segmented source content, by simply
>>>>>applying the same algorithm that was used to create the original version
>>>>
>>>>of
>>>>
>>>>
>>>>
>>>>
>>>>>the "double converted" XLIFF file.
>>>>>- It is NOT possible to automatically identify the translations for each
>>>>
>>>>of
>>>>
>>>>
>>>>
>>>>
>>>>>the source language segments in a way that is guaranteed to be correct.
>>>>
>>>>This
>>>>
>>>>
>>>>
>>>>
>>>>>requires an alignment process, which in order to succeed well requires a
>>>>>linguistic understanding of both the source and the target languages.
>>>>
>>>>There
>>>>
>>>>
>>>>
>>>>
>>>>>may be more than one way to divide the target content into translations
>>>>>matching the source segments, e.g. if a sentence with new information
> 
> has
> 
>>>>>been introduced in the target. In such a case it would even be
> 
> impossible
> 
>>>>>for a human to ensure that the source and target content is correctly
>>>>>matched up in segments that correspond to the translation memory.
>>>>>
>>>>>In the example above it looks easy to identify the corresponding target
>>>>>language segments as the individual sentences:
>>>>>
>>>>><trans-unit id="1.1">
>>>>><source xml:lang="en-US">Long sentence.</source>
>>>>><target xml:lang="sv-SE" state="translated">Lång mening.</target>
>>>>></trans-unit>
>>>>><trans-unit id="1.2">
>>>>><source xml:lang="en-US">Short sentence.</source>
>>>>><target xml:lang="sv-SE" state="translated">Kort mening.</target>
>>>>></trans-unit>
>>>>>
>>>>>However we must not forget that this involves a process of "guessing"
>>>>
>>>>which
>>>>
>>>>
>>>>
>>>>
>>>>>parts of the source and target that belongs together, as explained
> 
> above.
> 
>>>>It
>>>>
>>>>
>>>>
>>>>
>>>>>was to illustrate the problem with the "guessing" that I in my original
>>>>>example introduced a change where the long sentence had been translated
>>>>
>>>>into
>>>>
>>>>
>>>>
>>>>
>>>>>two target language sentences:
>>>>>
>>>>><trans-unit id="1">
>>>>><source xml:lang="en-US">Long sentence. Short sentence.</source>
>>>>><target xml:lang="sv-SE" state="final">Lång mening. Mer mening. Kort
>>>>>mening.</target>
>>>>></trans-unit>
>>>>>
>>>>>Here a tool that relies on matching up sentences between the source and
>>>>
>>>>the
>>>>
>>>>
>>>>
>>>>
>>>>>target would have several options. From the tool's perspective the
>>
>>correct
>>
>>
>>>>>segmentation could be either:
>>>>>
>>>>><trans-unit id="1.1">
>>>>><source xml:lang="en-US">Long sentence.</source>
>>>>><target xml:lang="sv-SE" state="translated">Lång mening. Mer
>>>>>mening.</target>
>>>>></trans-unit>
>>>>><trans-unit id="1.2">
>>>>><source xml:lang="en-US">Short sentence.</source>
>>>>><target xml:lang="sv-SE" state="translated">Kort mening.</target>
>>>>></trans-unit>
>>>>>
>>>>>Or:
>>>>>
>>>>><trans-unit id="1.1">
>>>>><source xml:lang="en-US">Long sentence.</source>
>>>>><target xml:lang="sv-SE" state="translated">Lång mening. </target>
>>>>></trans-unit>
>>>>><trans-unit id="1.2">
>>>>><source xml:lang="en-US">Short sentence.</source>
>>>>><target xml:lang="sv-SE" state="translated">Kort mening. Mer
>>>>>mening.</target>
>>>>></trans-unit>
>>>>>
>>>>>Without understanding the source and target languages there is no way
> 
> the
> 
>>>>>tool can know which of these options are correct.
>>>>>
>>>>>I hope this clarifies matters a bit.
>>>>>
>>>>>Best regards,
>>>>>Magnus
>>>>>
>>>>>-----Original Message-----
>>>>>From: Andrzej Zydron [mailto:azydron@xml-intl.com] 
>>>>>Sent: Saturday, March 27, 2004 11:07 AM
>>>>>To: xliff-seg@lists.oasis-open.org
>>>>>Subject: Re: [xliff-seg] Segmentation and filters
>>>>>
>>>>>Hi Magnus,
>>>>>
>>>>>Thank you for your reply. It looks as if we are together in total
>>>>
>>>>agreement
>>>>
>>>>
>>>>
>>>>
>>>>>on 
>>>>>point 1).
>>>>>
>>>>>As to point 2) thank you for pointing out the potential problems
> 
> involved
> 
>>>>>with
>>>>>segmentation which can arise where:
>>>>>
>>>>>a) The initial segmentation was incorrect.
>>>>>b) The translation requires that more than one segment is rendered in a
>>>>>single 
>>>>>segment in the target language.
>>>>>
>>>>>It would be very positive if we could agree that these two occurrences
>>
>>are
>>
>>
>>>>>definitive regarding this type of problem. In my experience they are.
>>>>>
>>>>
>>>>>From the semantic point of view a common term such as "merged segments"
>>>>
>>>>
>>>>>would 
>>>>>allow us to put these into a single common category.
>>>>>
>>>>>There is another category which I have come across, where there is no
>>>>>equivalent 
>>>>>translation possible for a given segment because for example the segment
>>>>>relates 
>>>>>to a particular address of a subsidiary that just does not exist. In
>>>>
>>>>xml:tm 
>>>>
>>>>
>>>>
>>>>
>>>>>these are termed as "no-equivlent translation available" or "no-equiv"
>>>>
>>>>(for
>>>>
>>>>
>>>>
>>>>
>>>>>short).
>>>>>
>>>>>Both instances need not pose any problem during the merging process. The
>>>>>target 
>>>>>element for the first trans-unit contains the translation while the
>>>>>following 
>>>>>trans-units are flagged as being merged. Using your example the effect
> 
> is
> 
>>>>as
>>>>
>>>>
>>>>
>>>>
>>>>>follows:
>>>>>
>>>>><trans-unit id="1.1">
>>>>><source xml:lang="en-US">Long sentence.</source>
>>>>><target xml:lang="sv-SE" state="translated">Lång mening. Mer mening.
>>>>
>>>>Kort
>>>>
>>>>
>>>>
>>>>
>>>>>mening.</target>
>>>>></trans-unit>
>>>>><trans-unit id="1.2">
>>>>><source xml:lang="en-US">Short sentence.</source>
>>>>><target xml:lang="sv-SE" state="merged"/>
>>>>></trans-unit>
>>>>>
>>>>>The XLIFF target "state" attribute would require the addition of the
>>>>>"merged" 
>>>>>value, or another attribute other than state can be used. You can now
> 
> use
> 
>>>>>this 
>>>>>intelligence to load your leveraged memory that will know that if "Long 
>>>>>sentence." is followed by "Short sentence." the equivalent translated
>>
>>text
>>
>>
>>>>>is 
>>>>>"Lång mening. Mer mening. Kort mening." and requires merging. There is
> 
> no
> 
>>
>>>>>restriction on the number of merged segments. A merged segment always
>>>>>relates to 
>>>>>the final non-merged segment.
>>>>>
>>>>>When the segmented XLIFF is merged back into the original XLIFF file the
>>>>>effect 
>>>>>is as follows:
>>>>>
>>>>><trans-unit id="1">
>>>>><source xml:lang="en-US"><tm:tu id="1.1">Long sentence.</tm:tu>
>>>>>                         <tm:tu id="1.2">Short sentence.</tm:tu>
>>>>></source>
>>>>><target xml:lang="sv-SE"><tm:tu id="1.1">Lång mening. Mer mening. Kort
>>>>>mening.</tm:tu>
>>>>>                         <tm:tu id="1.2" flag="merged"/>
>>>>></target>
>>>>></trans-unit>
>>>>>
>>>>>and the stripped out namespace version will look like this:
>>>>>
>>>>><trans-unit id="1">
>>>>><source xml:lang="en-US">Long sentence. Short sentence.</source>
>>>>><target xml:lang="sv-SE">Lång mening. Mer mening. Kort mening.</target>
>>>>></trans-unit>
>>>>>
>>>>>Which you can also load into memory.
>>>>>
>>>>>This approach does not restrict or hinder the translation or loading of 
>>>>>leveraged memories in any way. In fact it can supply the translation
>>>>
>>>>memory 
>>>>
>>>>
>>>>
>>>>
>>>>>software with additional hints that can automatically compensate for
>>>>>incorrect 
>>>>>segmentation should it occur again.
>>>>>
>>>>>In a similar vein we can also handle non-equivalent segments as in:
>>>>>
>>>>><trans-unit id="1">
>>>>><source xml:lang="en-US">The address of our Florida branch is 1
>>>>
>>>>Manhattan
>>>>
>>>>
>>>>
>>>>
>>>>>drive, Orlando FLA123</source>
>>>>><target xml:lang="sv-SE" state="no-equiv"/>
>>>>></trans-unit>
>>>>>
>>>>>Although I suspect that there may well be another mechanism already
> 
> built
> 
>>>>>into 
>>>>>XLIFF to handle this.
>>>>>
>>>>>XML provides such a rich vocabulary and syntax that this very easy to
>>>>>overcome 
>>>>>any segmentation issues in XLIFF.
>>>>>
>>>>>Best regards,
>>>>>
>>>>>AZ
>>>>>
>>>>>
>>>>>
>>>>>Magnus Martikainen wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>Hi all,
>>>>>>
>>>>>>Thanks very much Andrzej for your clear and structured arguments and
>>>>>>examples, this is very useful for further discussions on this topic.
>>>>>>
>>>>>>Here are my comments to this thread:
>>>>>>
>>>>>>1) I agree that segmentation should not be mandatory. 
>>>>>>However I am also of the opinion that segmentation should always be
>>>>>
>>>>>allowed,
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>whether the original content was supposedly segmented during extraction
>>>
>>>or
>>>
>>>
>>>
>>>>>>not. The reason is that detection of best possible segment boundaries
>>
>>may
>>
>>
>>>>>>still need to be adapted to best fit the tools and the translation
>>>>>
>>>>>memories
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>used during the localisation, which could use slightly different
>>>>>>segmentation. (A common example would be handling of tag at sentence
>>>>>>boundaries etc.) Since the goal for the user must be to achieve maximum
>>>>>>leverage from their translation memory resources it may be necessary to
>>>>>>adjust segmentation also in such cases.
>>>>>>As a side effect of this, if we agree that we always want to "allow"
>>>>>>segmentation of the content I see no need for an explicit
>>>>>>segmented="true/false" attribute.
>>>>>>
>>>>>>
>>>>>>2) I can think of a couple of situations where a "double conversion" as
>>>>>
>>>>>you
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>are suggesting would cause problems:
>>>>>>
>>>>>>a) It applies new "hard" boundaries to the segments. Thus segmentation
>>>>>>cannot be changed later, e.g. during interactive translation. Sometimes
>>>
>>>it
>>>
>>>
>>>
>>>>>>is necessary or desirable to change the default segmentation to
>>>>>
>>>>>accommodate
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>translation needs while working on the document. Examples include:
>>>>>>- the need to adjust segmentation that has been incorrectly applied
>>
>>(e.g.
>>
>>
>>>>>an
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>abbreviation in the middle of a sentence that has been wrongly
>>>
>>>interpreted
>>>
>>>
>>>
>>>>>>by the segmentation tool as the end of that sentence).
>>>>>>- the occasional need to translate two or more source sentences into
> 
> one
> 
>>>>>>target language sentence for it to be a meaningful translation
>>>>>>
>>>>>>
>>>>>>b) During backward conversion of the "doubly converted" XLIFF file to
>>
>>its
>>
>>
>>>>>>original XLIFF format the segment boundaries are lost.
>>>>>>If changes are made to the content of the XLIFF file after it has been
>>>>>>converted back to its original XLIFF format it is no longer possible to
>>>>>
>>>>>get
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>those changes back into the "double converted" XLIFF document, e.g. in
>>>>>
>>>>>order
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>to update a translation memory with those changes. Once converted back,
>>>>>
>>>>>the
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>segment boundaries in both the source and target segments are gone.
> 
> (The
> 
>>>>>>source segmentation can perhaps be re-created, but it is no longer
>>>>>
>>>>>possible
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>to with certainty determine the correct corresponding target segments.)
>>>>>>
>>>>>>Example: If the "working" XLIFF file (the segmented version which is
>>
>>used
>>
>>
>>>>>to
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>interact with the translation memory during translation) after
>>>
>>>translation
>>>
>>>
>>>
>>>>>>contains this:
>>>>>>
>>>>>><trans-unit id="1.1">
>>>>>><source xml:lang="en-US">Long sentence.</source>
>>>>>><target xml:lang="sv-SE" state="translated">En lång mening.</target>
>>>>>></trans-unit>
>>>>>><trans-unit id="1.2">
>>>>>><source xml:lang="en-US">Short sentence.</source>
>>>>>><target xml:lang="sv-SE" state="translated">Kort mening.</target>
>>>>>></trans-unit>
>>>>>>
>>>>>>Both of these trans-units belong to the same <trans-unit> in the
>>
>>original
>>
>>
>>>>>>XLIFF file, and when the XLIFF file is converted back to its original
>>>>>
>>>>>XLIFF
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>format it could look like this (depending on the content of the
> 
> skeleton
> 
>>>>>>file):
>>>>>>
>>>>>><trans-unit id="1">
>>>>>><source xml:lang="en-US">Long sentence. Short sentence.</source>
>>>>>><target xml:lang="sv-SE" state="translated">En lång mening. Kort
>>>>>>mening.</target>
>>>>>></trans-unit>
>>>>>>
>>>>>>Now someone decides in the last minute that the translation needs to be
>>>>>>changed - the long sentence is for some reason better translated as two
>>>>>>sentences. This change is approved and signed-off. The XLIFF file is
>>>>>
>>>>>changed
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>into:
>>>>>>
>>>>>><trans-unit id="1">
>>>>>><source xml:lang="en-US">Long sentence. Short sentence.</source>
>>>>>><target xml:lang="sv-SE" state="final">Lång mening. Mer mening. Kort
>>>>>>mening.</target>
>>>>>></trans-unit>
>>>>>>
>>>>>>Unfortunately there is no way to easy way to update the translation
>>>
>>>memory
>>>
>>>
>>>
>>>>>>with these changes since the original segment boundaries that were used
>>>>>>during translation were lost. 
>>>>>>Tools can of course try to automatically "guess" the segment boundaries
>>>
>>>in
>>>
>>>
>>>
>>>>>>the source and target and somehow match them up, but this is not a
>>>
>>>trivial
>>>
>>>
>>>
>>>>>>task as can be seen from this example. There is no way an automatic
> 
> tool
> 
>>>>>can
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>determine if the middle sentence in the modified target should be
> 
> paired
> 
>>>>>>with the first or the last sentence (if either). Thus there is no way
> 
> to
> 
>>>>>>safely update the translation memory automatically with these changes.
>>>>>>
>>>>>>
>>>>>>c) The converted XLIFF file looses its "identity" or its direct
>>>
>>>connection
>>>
>>>
>>>
>>>>>>with the underlying data format.
>>>>>>Tools that have been developed to specifically process a particular
> 
> file
> 
>>>>>>type wrapped in XLIFF can not be used on the "converted" XLIFF file
>>>
>>>since:
>>>
>>>
>>>
>>>>>>- the original skeleton is no longer available/usable
>>>>>>- some of the content in the original XLIFF file (in particular tags
>>>>>
>>>>>between
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>sentences) has moved into the new skeleton.
>>>>>>- the new skeleton has been created with a tool and process unknown to
>>>
>>>any
>>>
>>>
>>>
>>>>>>other XLIFF tools.
>>>>>>
>>>>>>A typical example of a tool that would be useful to be able to run
>>
>>during
>>
>>
>>>>>>the localisation process is a verification/validation tool to ascertain
>>>>>
>>>>>that
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>the translated content can be converted back to a valid original
> 
> format.
> 
>>>>>>Examples of validation tools:
>>>>>>- Tag verification, validation against the schema, DTD, or other rules
>>>>>
>>>>>that
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>the content must adhere to.
>>>>>>- Length verification to ensure that translated content does not exceed
>>>>>>length limitations (which may be specified explicitly in the XLIFF
>>
>>file).
>>
>>
>>>>>>Both of these tasks require dealing with the underlying native data
> 
> that
> 
>>>>>the
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>XLIFF file wraps in order to perform their jobs. Due to the reasons
>>>
>>>stated
>>>
>>>
>>>
>>>>>>above the "doubly converted" XLIFF files cannot be used for this.
>>>>>>
>>>>>>
>>>>>>d) An additional level of unnecessary complexity is introduced, since
> 
> it
> 
>>>>>is
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>necessary to do an additional transformation/conversion of the XLIFF
>>>>>>document before it can be processed by the filter that created it.
>>>>>>In something as complex as a typical localisation project this is not a
>>>>>>factor to be neglected. If an average project has 100 files translated
>>>>>
>>>>>into
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>10 languages that means an additional 1000 file conversions necessary
> 
> to
> 
>>>>>>complete the project. If the workflow for this is not entirely
> 
> automated
> 
>>>>>it
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>could mean that someone may need to use a tool to manually check each
>>>>>>individual file to determine which state it is in before the files can
>>
>>be
>>
>>
>>>>>>delivered or further processed.
>>>>>>If other tools used in the localisation process use the same approach
> 
> of
> 
>>>>>>converting the XLIFF file to a new XLIFF format the complexity is
>>>>>>multiplied... All this can be avoided if we support the notion of
>>>
>>>segments
>>>
>>>
>>>
>>>>>>directly in XLIFF - then the very same XLIFF file can be used in all
>>>>>
>>>>>stages
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>of the process.
>>>>>>
>>>>>>
>>>>>>Looking forward to further comments and discussions on this topic!
>>>>>>
>>>>>>Best regards,
>>>>>>Magnus Martikainen
>>>>>>TRADOS Inc.
>>>>>>
>>>>>>-----Original Message-----
>>>>>>From: Andrzej Zydron [mailto:azydron@xml-intl.com] 
>>>>>>Sent: Tuesday, March 23, 2004 2:07 PM
>>>>>>To: xliff-seg@lists.oasis-open.org
>>>>>>Subject: [xliff-seg] Segmentation and filters
>>>>>>
>>>>>>Hi,
>>>>>>
>>>>>>First of all I would like to thank Magnus for the hard work he has put 
>>>>>>in so far and the detailed document that he has prepared. This has 
>>>>>>provided a clear starting point for further discussions.
>>>>>>
>>>>>>To kick off this thread I would like to state my views on the 
>>>>>>segmentation issue:
>>>>>>
>>>>>>1) Segmentation within XLIFF should not be mandated. It should be 
>>>>>>optional. There are implementations such as xml:tm where segmentation
> 
> is
> 
>>
>>>>>>done before extraction. It is also quite easy to envisage situations 
>>>>>>where XLIFF is the output of an existing translation workbench system 
>>>>>>that has already segmented and pre-matched data for sending out to a 
>>>>>>translator who will import it into an XLIFF aware editing environment.
>>>>>>
>>>>>>I can also see Magnus' point that quite often XLIFF will contain 
>>>>>>unsegmented data.
>>>>>>
>>>>>>One solution would be to provide an optional "segmented" attribute at 
>>>>>>the <file> element level which states that the data has already been 
>>>>>>segmented, with a default value of "false". If the data has been 
>>>>>>segmented than an xlink attribute to the SRX url could also be
> 
> provided.
> 
>>>>>>2) One way of handling segmentation within XLIFF is to create a 
>>>>>>secondary XLIFF document from the current XLIFF document that has a 
>>>>>>separate <trans-unit> element for each segment. This would effectively 
>>>>>>be an segmentation extraction of the original XLIFF document. This has 
>>>>>>the one significant advantage that no further extensions are required
> 
> to
> 
>>
>>>>>>the XLIFF standard. It does away with all the potential complexity of 
>>>>>>trying to nest <trans-unit> elements or add workable syntax to cope
> 
> with
> 
>>
>>>>>>multiple source and target segments within a <trans-unit>.
>>>>>>
>>>>>>Because XLIFF is a well defined XML format it is very easy to write an 
>>>>>>extraction + segmentation filter for it to provide an XLIFF file where 
>>>>>>the <trans-unit> elements are at the segment level, along with a 
>>>>>>skeleton file for merging back.
>>>>>>
>>>>>>After translation you can elect to store leveraged memory at both the 
>>>>>>segmented and unsegmeted levels.
>>>>>>
>>>>>>Here is an example based on Magnus' data:
>>>>>>
>>>>>>Step 1: Original XLIFF file:
>>>>>>
>>>>>><body>
>>>>>><trans-unit id="1">
>>>>>> <source xml:lang="en-US">The Document Title</source>
>>>>>></trans-unit>
>>>>>><trans-unit id="2">
>>>>>> <source xml:lang="en-US">First sentence. <bpt 
>>>>>>id="1">[ITALIC:</bpt>This is an important sentence.<ept 
>>>>>>id="1">]</ept></source>
>>>>>></trans-unit>
>>>>>><trans-unit id="3">
>>>>>> <source xml:lang="en-US">Ambiguous sentence. More <bpt 
>>>>>>id="1">[LINK-to-toc:</bpt>content<ept id="1">]</ept>.</source>
>>>>>></trans-unit>
>>>>>></body>
>>>>>>
>>>>>>Step 2: Introduce namespace segmentation into XLIFF file
>>>>>>
>>>>>><body xmlns:tm="http://www.xml-intl.com/dtd/tm.xsd";>
>>>>>><trans-unit id="1">
>>>>>> <source xml:lang="en-US"><tm:tu id="1.1">The Document 
>>>>>>Title</tm:tu></source>
>>>>>></trans-unit>
>>>>>><trans-unit id="2">
>>>>>> <source xml:lang="en-US"><tm:tu id="2.1">First sentence.</tm:tu> 
>>>>>><bpt id="1">[ITALIC:</bpt><tm:tu id="2.2">This is an important 
>>>>>>sentence.</tm:tu><ept id="1">]</ept></source>
>>>>>></trans-unit>
>>>>>><trans-unit id="3">
>>>>>> <source xml:lang="en-US"><tm:tu id="3.1">Ambiguous 
>>>>>>sentence.</tm:tu> <tm:tu id="3.2">More <bpt 
>>>>>>id="1">[LINK-to-toc:</bpt>content<ept id="1">]</ept>.</tm:tu></source>
>>>>>></trans-unit>
>>>>>></body>
>>>>>>
>>>>>>Step 3: Using a simple XSLT transformation create new segmented XLIFF
>>>>>
>>>>>file:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>><body segmented="true" srx="http://www.xml-intl.com/srx/en-US.srx";>
>>>>>><trans-unit id="1.1">
>>>>>> <source xml:lang="en-US">The Document Title</source>
>>>>>></trans-unit>
>>>>>><trans-unit id="2.1">
>>>>>> <source xml:lang="en-US">First sentence.</source>
>>>>>></trans-unit>
>>>>>><trans-unit id="2.2">
>>>>>> <source xml:lang="en-US">This is an important sentence.</source>
>>>>>></trans-unit>
>>>>>><trans-unit id="3.1">
>>>>>> <source xml:lang="en-US">Ambiguous sentence.</source>
>>>>>></trans-unit>
>>>>>><trans-unit id="3.1">
>>>>>> <source xml:lang="en-US">More <bpt 
>>>>>>id="1">[LINK-to-toc:</bpt>content<ept id="1">]</ept>.</source>
>>>>>></trans-unit>
>>>>>></body>
>>>>>>
>>>>>>And Skeleton file:
>>>>>>
>>>>>><body xmlns:tm="http://www.xml-intl.com/dtd/tm.xsd";>
>>>>>><trans-unit id="1">
>>>>>> <source xml:lang="en-US"><tm:tu id="1.1"><ext 
>>>>>>id="1.1"/></tm:tu></source>
>>>>>></trans-unit>
>>>>>><trans-unit id="2">
>>>>>> <source xml:lang="en-US"><tm:tu id="2.1"><ext id="2.1"/></tm:tu> 
>>>>>><bpt id="1">[ITALIC:</bpt><tm:tu id="2.2"><ext id="2.2"/></tm:tu><ept 
>>>>>>id="1">]</ept></source>
>>>>>></trans-unit>
>>>>>><trans-unit id="3">
>>>>>> <source xml:lang="en-US"><tm:tu id="3.1"><ext id="3.1"/></tm:tu> 
>>>>>><tm:tu id="3.2"><ext id="3.2"/></tm:tu></source>
>>>>>></trans-unit>
>>>>>></body>
>>>>>>
>>>>>>Step 3: Put segmented XLIFF file through whatever matching process you 
>>>>>>want to, to produce:
>>>>>>
>>>>>><body segmented="true" srx="http://www.xml-intl.com/srx/en-US.srx";>
>>>>>><trans-unit id="1.1">
>>>>>> <source xml:lang="en-US">The Document Title</source>
>>>>>> <target xml:lang="sv-SE" state="translated" 
>>>>>>state-qualifier="leveraged-tm">Dokumentrubriken</target>
>>>>>></trans-unit>
>>>>>><trans-unit id="2.1">
>>>>>> <source xml:lang="en-US">First sentence.</source>
>>>>>> <target xml:lang="sv-SE" state="translated" 
>>>>>>state-qualifier="leveraged-tm">Första meningen.</target>
>>>>>></trans-unit>
>>>>>><trans-unit id="2.2">
>>>>>> <source xml:lang="en-US">This is an important sentence.</source>
>>>>>>   <alt-trans origin="transation memory" match-quality="80%">
>>>>>>     <source xml:lang="en-US">This is an extremely important 
>>>>>>sentence.</source>
>>>>>>     <target xml:lang="sv-SE">En mycket viktig mening.</target>
>>>>>>   </alt-trans>
>>>>>></trans-unit>
>>>>>><trans-unit id="3.1">
>>>>>> <source xml:lang="en-US">Ambiguous sentence.</source>
>>>>>> <target xml:lang="sv-SE" state="needs-review-translation">Omstridd 
>>>>>>mening.</target>
>>>>>>   <note annotates="target" from="Swedish Translator">This 
>>>>>>translation may not be appropriate. Please evaluate it
> 
> carefully!</note>
> 
>>>>>></trans-unit>
>>>>>><trans-unit id="3.1">
>>>>>> <source xml:lang="en-US">More <bpt 
>>>>>>id="1">[LINK-to-toc:</bpt>content<ept id="1">]</ept>.</source>
>>>>>> <taget xml:lang="sv-SE" state="translated">Ytterligare <bpt 
>>>>>>id="1">[LINK-to-toc:</bpt>inneha*ll<ept id="1">]</ept>.</target>
>>>>>></trans-unit>
>>>>>></body>
>>>>>>
>>>>>>
>>>>>>Step 4: Using nothing more than XSLT, merge the translated document 
>>>>>>back, then strip out the segmented namespace elements using another 
>>>>>>simple XSLT transformation and you arrive at a translated XLIFF file 
>>>>>>that is equal to the original source language unsegmented file.
>>>>>>
>>>>>>This approach has the benefit of requiring minimal or possibly no
> 
> change
> 
>>
>>>>>>to the existing excellent XLIFF specification.
>>>>>>
>>>>>>Hope this helps kick off the thread.
>>>>>>
>>>>>>Regards,
>>>>>>
>>>>>>AZ
>>>>>>
>>>>>
>>>>>
> 

-- 


email - azydron@xml-intl.com
smail - Mr. A.Zydron
         24 Maybrook Gardens,
         High Wycombe,
         Bucks HP13 6PJ
Mobile +(44) 7966 477181
FAX    +(44) 870 831 8868
www - http://www.xml-intl.com

This message contains confidential information and is intended only
for the individual named.  If you are not the named addressee you
may not disseminate, distribute or copy this e-mail.  Please
notify the sender immediately by e-mail if you have received this
e-mail by mistake and delete this e-mail from your system.
E-mail transmission cannot be guaranteed to be secure or error-free
as information could be intercepted, corrupted, lost, destroyed,
arrive late or incomplete, or contain viruses.  The sender therefore
does not accept liability for any errors or omissions in the contents
of this message which arise as a result of e-mail transmission.  If
verification is required please request a hard-copy version. Unless
explicitly stated otherwise this message is provided for informational
purposes only and should not be construed as a solicitation or offer.





[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]