Re: More on Segmentation in XLIFF

Hi all,

TRADOS has now joined OASIS, and I have just applied for membership in this group.

I noticed Tony’s posting of my email conversation with him and Yves on the topic of segmentation. Unfortunately the posted email thread did not contain the most interesting parts of our exchange, so I take the liberty of enclosing that part in this email (see below).

Best regards,

Magnus Martikainen

TRADOS Inc.

--------------------------------------

From: Magnus Martikainen [mailto:magnus@trados.com]

Sent: Tue 1/27/2004 8:11 PM

To: Yves Savourel

Cc: Jochen Hummel; Tony Jewtushenko; John Reid

Hi Yves,

Thank you very much for your extensive response. It is now clear to me that I somewhat misinterpreted the purpose of XLIFF 1.1 in its role as an interchange file format for localisation. I did not realize that it does not yet attempt to address the concept of segments during localisation phases.

I am glad to hear that there is an interest in addressing this, and I believe that this would be a very important step towards providing a common standard file format that is also more useful for processing and interchange of content between different tools during the most labour intense localisation phases, the translation and edit phases.

I also definitely agree with you that supporting translation recycling on different granularity levels (such as paragraph, sentence, or even phrase) is of vital importance in particular for future CMS integration capabilities.

If you are interested, as a discussion starter on how segment structure could be introduced in a future version of XLIFF, here are a couple of very early thoughts from my side.

Given your sample trans-unit:

<trans-unit id='100'>

<source xml:lang='en'>First sentence. Second sentence.</source>

<trans-unit>

We could allow a segment structure to be introduced for the content of the trans-unit, like this:

<trans-unit id='100'>

<segment-group>

<source xml:lang='en'>First sentence.</source>

</segment>

</segment>

<source xml:lang='en'>Second sentence.</source>

</segment>

</segment-group>

</trans-unit>

We can allow <atl-trans> both on <trans-unit> and <segment> level, like this:

<trans-unit id='100'>

<segment-group>

<source xml:lang='en'>First sentence.</source>

<alt-trans>

...

</alt-trans>

</segment>

</segment>

<source xml:lang='en'>Second sentence.</source>

<alt-trans>

...

</alt-trans>

</segment>

</segment-group>

<alt-trans>

...

</alt-trans>

</trans-unit>

It would then be up to the user interface tools to choose how to present the alt-trans content to the user depending on which level of content it is associated with, and the user should have the option to translate the entire trans-unit as one piece, to translate individual segments as they appear, or even to change the segment structure while translating, e.g. to merge and split sentences as is sometimes needed for producing conceptually valid translations.

Some comments on this model:

* I believe it is important to explicitly recognise that during the phases of localisation (e.g. between preparation, translation, editing, review) there may be a need for a (segment) structure inside a trans-unit, as different parts of the trans-unit content can be in different state during these phases. The segment extension of the XLIFF standard would be targeted directly at addressing the need for tool interoperability during these phases. (Attributes available for the <segment> element, which I have not addressed yet, should also reflect this.)

* I believe the segment structure should be on the trans-unit level, and not inside the current source and target elements. Conceptually it makes more sense to me to define a segment as having a source and target, as they have a pretty strong coupling, in fact this is very similar to the way a trans-unit today has a source and a target.

* Likewise it is important to make the distinction between the "hard" boundaries that the <trans-unit> represents and the "soft" boundaries that the <segment> represents. The "hard" structure cannot ever be changed - it is set "in stone" by the filter that produces the XLIFF, and must remain intact for the backward conversion to work. The "soft" structure on the other hand can be changed as desired by any tool without affecting validity or functionality.

* For backward compatibility reasons the segment structure inside the trans-unit is optional. It can also easily be removed when interacting with XLIFF tools not yet supporting this feature. Hopefully this would be a straight forward operation that could be accomplished e.g. with an XSLT transformation.

* An interesting question is whether <bpt> and <ept> elements should in this model be matched only within a <segment> or if they may remain matched within the scope of the <trans-unit> even if they span multiple segments. Conceptually, at least from a translation memory point-of-view, it would be valuable to have them matched within the segments. On the other hand that would potentially require changing some of them to <it> when introducing segmentation. (This would by the way be the same if a trans-unit were divided into multiple parts in XLIFF 1.1, as one of your workarounds suggested.) As a one-way operation this is ok, but things get complicated if segmentation is later changed, or the segments are removed.

I'd be very interested in hearing more of your thoughts on this topic.

Thank you again!

Magnus

-----Original Message-----

From: Yves Savourel [mailto:yves@opentag.com]

Sent: Tuesday, January 27, 2004 1:40 PM

To: Magnus Martikainen

Cc: Jochen Hummel; Tony Jewtushenko; John Reid

Subject: RE: Questions on XLIFF

Dear Magnus,

I don't think I've met you but I certainly know your name and I've heard

(good things) about you. So I'm thankful that you took the time to expose

your thoughts about XLIFF.

I've CCed two of the XLIFF TC members in this answer, thinking they may be

able to bring useful insights in this topic: Tony from Oracle (the TC Chair)

and John from Novell (developer using XLIFF).

I think there are different ways to approach the use of XLIFF documents in

translation tools, depending on how much of XLIFF's features are used. So

maybe a step-by-step approach could be considered.

--1- Simple Use

The first one is very simple and pretty much works already. For example, RWS

has been using TagEditor to translate XLIFF files for years now, and we

have--almost--no problems. The reason is because we use currently a simple

content: some text inside <trans-unit>, no <alt-trans>. So the issue of

segmentation doesn't exist: TagEditor segments as it wants and we get back

clean files our filters can merge.

We do have to do some workaround to ensure that:

a) The translation goes into the <target> elements (so we make sure the

XLIFF document has a <target> element with the source text).

b) The <trans-unit> element with an attribute translate='no' are protected

(so we add an additional <NTBT> tag inside any <target> with translate='no'

and use a DTD settings file where <NTBT> is protected).

It's not very pretty, but overall the process works.

--2- Segmenting

The real problems start when XLIFF contains <alt-trans> and/or when there is

assumption from the producer of the XLIFF document that the translation tool

will look at the <trans-unit> elements as leverageable segments.

You wrote: "One of the main obstacles I see for XLIFF as a generic

interchange format for translatable content is how the format requires

segmentation to be applied at a filtering stage, without allowing it to be

changed later in the process."

I think you may be misled by the name of the element. An XLIFF <trans-unit>

does not pre-suppose any type of segmentation. So it is not quite correct to

say that the segmentation cannot be changed later: nothing prevent an XLIFF

document to be manipulated in any way, as long as it is returned to a form

that will be usable by its original filter. In other words, one could take

an extracted XLIFF document where the <trans-unit> element contain

'paragraphs', run it through a utility that will use the segmentation engine

of a translation tool such as Trados' and "re-break" the <trans-unit>

element. This would give you the equivalent of a pre-segmented Trados RTF

file.

For example, an original extraction like this:

<trans-unit id='100'>

<source xml:lang='en'>First sentence. Second sentence.</source>

<trans-unit>

Could be transformed as something like this:

<trans-unit id='100-1'>

<source xml:lang='en'>First sentence. </source>

<trans-unit>

<trans-unit id='100-2'>

<source xml:lang='en'>Second sentence.</source>

<trans-unit>

</group>

Or you could also use the <mrk> inline element to do the same thing, like

this:

<trans-unit id='100'>

<source xml:lang='en'><mrk mtype='phrase'>First sentence. </mrk><mrk

mtype='phrase'>Second sentence.</mrk></source>

<trans-unit>

The problem (currently) is that XLIFF does not address explicitly the topic

of segmentation at all. There is no guidelines or rules to tell the

extractor how to represent segments, or even what a segment is.

Will this be addressed in a future version of XLIFF? I sure hope so. There

are several possibilities:

- New elements or a different namespace inside the <source> and <target>

element (currently non-XLIFF namespaces are not allow there, you have to use

<mrk> for assigning specific information to runs of text).

- SRX could possibly used at some level, although I'm not sure yet how this

would play into the picture.

- Some guidelines could also be set to use <group> and <trans-unit> in a

certain way to decompose a 'paragraph' into 'sentences', so any filter could

rebuilt the original 'paragraph' and merge it back.

--3- The Core Issue

I think we could see the whole problem from a different angle. Some of XLIFF

requirements do not fit into a classic translation tool system because XLIFF

is not a source document, but more like the output of a content management

system.

In my view TM tools serve essentially two purposes: First they allow to not

retranslate something that was translated in a previous project. And

secondly they allow to re-use an existing translation when working on a new

text.

I think it's important to make a distinction between these two functions.

The first one is a fix for a problem that exists upstream in the process:

the fact that we are not able to know what has changed from one version of

the source document to the next, or at least not able to package it in a way

it's useable. But nowadays, this limitation is slowly disappearing with the

use of CMS. More and more the customer of translation knows what has changed

and does not need the TM tool do provide such function. This said, TM tools

are still very useful because they still provide their second function:

reuse of existing text when translating new one.

So, maybe one way to approach XLIFF, is to see it just like a CMS:

<trans-unit> element being holder of a text object (whatever its

granularity), <alt-trans> elements being existing translations of these text

objects.

Maybe the support of XLIFF could be done step by step:

First addressing the more general issues, which are not linked to CMS-type

problems and exist in other XML formats than XLIFF: a) support for taking

the source text from one place (<source>) and putting it in another

(<target>), and b) support for conditional translation (translate='no').

For example, how would I define a DTD settings for this file? (a real-life

example):

<data type="text">images/cancel.gif</data>

</component>

<data type="text">Cancel</data>

</component>

</rsrc>

</dialogue>

Here only "Cancel" is to localize. So the only efficient way to express what

is translatable would be by an XPath expression:

"//component[@type='caption']/data[@type='text']".

We run into such issues with many XML documents and have to create XSL

templates to workaround TagEditor's limitations. So allowing the DTD

settings to be more flexible in that aspect would help not only in

supporting XLIFF, but more importantly in supporting many other XML formats.

Then, we could look at how <alt-trans> could be supported, and it's effects

on how <trans-unit> elements should be arranged for such purpose.

So to go back to your original questions:

"Ideally a file filter should only need to distinguish translatable parts of

a file from non-translatable parts, and leave it at that. Segmentation

should be applied and managed (and possibly also changed) by other tools

later in the process, without affecting the ability of the file filter to

convert the segmented file back to native format."

1) Is there a convenient and compatible way to support this in XLIFF 1.1?"

Answer: Currently XLIFF does not assume anything specific about

segmentation. So XLIFF filters can do this, and as far as the ones I know,

they actually do exactly that: just separate the text from the code.

"2) Are there already plans on extending the XLIFF standard in the future to

better support this?"

I certainly hope we will be able to come up with a mechanism to integrate

segmentation, a way of re-assembling segmented <trans-unit> elements, or

something of that order. And your collaboration would certainly be very

welcomed.

To conclude, I'd like to underline that there are more and more cases now

where the granularity of the text to translate cannot be only driven by the

translator's workbench. With CMS we have to take in account the fact that

part of the traditional function of the TM tool can now be done at the

document authoring/management level, in some cases working with 'paragraph'

rather than sentence. I think that ultimately if we find a way to

reconciliate those two concepts, the remaining problems of XLIFF integration

in translation tools will be solved.

I think Trados has more and more experience in working with CMS, so maybe

some of that knowledge could be used for XLIFF as well?

That's all I can think of for now.

Cheers,

-yves

________________________________

From: Magnus Martikainen [mailto:magnus@trados.com]

Sent: Monday, January 26, 2004 8:47 PM

To: Yves Savourel

Cc: Jochen Hummel

Subject: Questions on XLIFF

Hi Yves,

My apologies if you got another copy of this email - I accidentally hit the

wrong key while typing and it was sent before I had finished it. I tried to

recall it, but I may have been too late.

I don't think we have ever met in person, but I am well aware of your

extensive presence in the localisation industry. You may remember my name,

e.g. from the LISA ITS group. I am the Chief Software Architect in TRADOS.

Jochen Hummel suggested that I contact you directly with some questions

about XLIFF - I hope you don't mind?

I have been looking closely at the XLIFF 1.1 specification lately, amongst

other things in order to see how we can better support it as an interchange

format or even as a natively supported file format for TRADOS in the future.

One of the main obstacles I see for XLIFF as a generic interchange format

for translatable content is how the format requires segmentation to be

applied at a filtering stage, without allowing it to be changed later in the

process.

Let me explain:

Since all content in XLIFF must be stored inside translation units, a file

conversion tool that produces XLIFF output must decide where to introduce

the translation unit boundaries. While for some file formats there may be

natural breaks (e.g. in software files, which XLIFF seems to be concentrated

on) when dealing with larger volumes of running text (e.g. in documentation

and help files) the file conversion tool would have many options on how to

break the content into segments (e.g. based on tags, paragraphs, or

sentences).

Once the translation units have been introduced in the XLIFF file there is

no way to change the segmentation, as the process for assembling the XLIFF

body and skeleton is not at all governed by the standard, but is left

completely up to the file conversion tool. If an XLIFF translation unit were

changed into two translation units this would very likely break the

conversion of the translated XLIFF file back to the native file format.

Segmentation of text into sentences (this being the most common type of

segmentation used with translation memories) is a complex task that requires

sophisticated linguistically aware algorithms to produce good results.

Translation memory tools have over the years developed and fine-tuned these

algorithms for different source languages, and this is what has been used to

produce the translation memory content many companies have built up over

time. The only way to achieve maximum recycling against such translation

memories is to use the very same algorithms to identify sentences in the

content to be translated.

If the content to be translated resides in an XLIFF file, translation unit

boundaries have already been set by the file conversion tool, and cannot

easily be adapted to suite the translation memory. As it is unlikely that

the file conversion tool uses the exact same segmentation algorithm as the

translation memory this will lead to reduced translation memory recycling.

Even small differences in segmentation between the translation memory

content and the XLIFF files can lead to big costs.

Further, as segmentation is generated by the file conversion tool, also

recycling between file formats, or even recycling within the same file

format when different file conversion tools have been used, can be seriously

affected.

As I see it the problem lies in that the notion of a translation unit is

enforced upon the content at a stage in the process long before it is known

what would be the most suitable segmentation for that content.

Ideally a file filter should only need to distinguish translatable parts of

a file from non-translatable parts, and leave it at that. Segmentation

should be applied and managed (and possibly also changed) by other tools

later in the process, without affecting the ability of the file filter to

convert the segmented file back to native format.

My questions:

1) Is there a convenient and compatible way to support this in XLIFF 1.1?

2) Are there already plans on extending the XLIFF standard in the future to

better support this?

Best regards,

Magnus Martikainen

Chief Software Architect

TRADOS Incorporated

1292 Hammerwood Ave.

Sunnyvale, CA 94089

Ph: +1-408-743 3564

xliff message