xliff message

Subject: RE: [xliff] Segmentation as core or not

From: David Walters <waltersd@us.ibm.com>
To: xliff@lists.oasis-open.org
Date: Thu, 3 Nov 2011 13:47:29 -0500

Here is a simple XML example.

<?xml version="1.0" encoding="utf-8"?>

<document>

<title>This is my document title</title>

<body id="one" short="Document's short description">

<para num="first">This document describes how the user is to use product [product]. The first

step is to press the <bi>start</bi> button; there are no other actions.

</para>

</body>

</document>

Say this format is unique to this product, so no translation tools supports it. The developer has been told to provide only XLIFF files for globalization purposes. He does not know about terminology, word counting, segmentation, etc. He only know what pieces of text are translatable. The XLIFF 1.2 file he would probably generate would be:

<?xml version="1.0" encoding="utf-8"?>

<xliff version="1.2" xml:lang="EN">
<file source-language="EN" datatype="plaintext" original="file.xml"> <header></header> <body> <trans-unit id="1">
<source>This is my document title</source>
</trans-unit> <trans-unit id="2">
<source>Document's short description</source>
</trans-unit> <trans-unit id="3">
<source>This document describes how the user is to use product <x id="1"/>. The first

step is to press the <g id="2">start</g> button; there are no other actions.

</source>

</trans-unit>
</body>

</file> </xliff>

This simple XLIFF file should not contain segmentation information because some tools may not care about segmentation. For example, a program for term extraction, spell checking, word counting, or grammar checking only cares about the readable text.

In my opinion, the XLIFF "core" elements should be the minimum set of elements which are required to extract the source text from the original file format in such a way that the source text can be replaced by the translated text in the original file, and the translated file will be usable by the product. This would include:

Identify that the XML file contains XLIFF information.
Identify the source file from which the text was extracted and its attributes, like file name, source language, original file format, etc.
Identify each contiguous block of text based on the source file's formatting rules.
Identify non-translatable inline items which are imbedded in the text.
Identify text formatting requirements, like text on a single line, reflowable, maximum length, etc.

Any XLIFF elements or attributes which are needed for a specific application's use would be placed in a "module".

David

Corporate Globalization Tool Development
EMail: waltersd@us.ibm.com
Phone: (507) 253-7278, T/L:553-7278, Fax: (507) 253-1721

CHKPII: http://w3-03.ibm.com/globalization/page/2011
TM file formats: http://w3-03.ibm.com/globalization/page/2083
TM markups: http://w3-03.ibm.com/globalization/page/2071

"Rodolfo M. Raya" ---11/02/2011 12:09:10 PM---Hi Helena,

From:	"Rodolfo M. Raya" <rmraya@maxprograms.com>
To:	<xliff@lists.oasis-open.org>
Date:	11/02/2011 12:09 PM
Subject:	RE: [xliff] Segmentation as core or not
Sent by:	<xliff@lists.oasis-open.org>

Hi Helena,

There is a confusion in terminology. Changing the element name to <part> helps in visualization but doesn’t solve the issue at hand.

An XLIFF file is a container for text extracted for localization. If there isn’t text to localize, there is no XLIFF because there is nothing to Interchange (the “L” and “I” in XLIFF are failing).

In many cases, the text extracted for localization needs to be further partitioned to facilitate the translation process. There are cases in which translators prefer to translate paragraphs of text because it produces better translations. In other cases (probably the majority of cases), translators prefer to translate sentences because it facilitates TM matching and translation reuse. The process of splitting extracted text into sentences is known as “segmentation”.

The issue listed in the wiki related to segmentation deals with division of extracted text into “segments” and rearrangement of the segmented text when the boundaries detected by an automated process are not suitable according to the preferences of the translator.

Segmentation can be done during text extraction, when the XLIFF file is created, or in a second pass after the XLIFF has been created. Segmentation also happens at translation time when translators merge or split existing segments.

An XLIFF file must have containers for the extracted text. Having those containers is not a “feature”, it is a necessity. Being able to split the text and store the “segments”, “parts” or “fragments” in the same XLIFF can be viewed as a feature that may be qualified as “core” or “module”.

The proposal currently in the wiki doesn’t make it easy to differentiate between text that has been “extracted” and text that has been “extracted and segmented”. If we had a clear distinction between just extracted and segmented we would be able to tell if the segmentation process and its result belongs to the “core” or “module” category.

When segmentation is done while the XLIFF file is being generated, each segment can be represented as a unit for translation. That was the original way of working with XLIFF 1.0 and 1.1. In XLIFF 1.2 the notion of representing segmentation in the XLIFF document was introduced.

Working with XLIFF 1.2 you can have a segmented file with each <trans-unit> containing one segment or you can have files that contain multiple segments in a <trans-unit> element, each of them enclosed in special markup designed with a combination of <seg-source> and <mrk> elements.

The model for representing segmentation introduced in XLIFF 1.2 has several problems that must be fixed in XLIFF 2.0.

The proposal for using <unit>, <segment> and <ignorable> that we have in current draft of the XLIFF schema allows representing segmentation. The problem with the schema is that it does not tell you if the text contained in the XLIFF file has been just extracted or extracted and segmented.

The work you did with Yves in the wiki helps in understanding the status of the extracted text. With the attributes, elements and processing expectations you designed it is possible to know if the text has been segmented, if further segmentation is allowed and what restrictions apply. It’s a very nice design.

The discussion is about the qualification of your work. Is it essential of is it optional? If essential, that’s a “core” feature and the used elements and attributes should be in the main XML Schema and documented as integral part of XLIFF. If representing segmentation is an optional goal, then those elements and attributes should live in a separate optional XML Schema (a “module”) and documented in an annex of the specification or in a separate guideline.

In my personal opinion, representing segmentation as was designed should be a required part of the XLIFF 2.0 standard. I would call it a “core” feature.

Regards,
Rodolfo
--
Rodolfo M. Raya rmraya@maxprograms.com
Maxprograms http://www.maxprograms.com

From: xliff@lists.oasis-open.org [mailto:xliff@lists.oasis-open.org] On Behalf Of Helena S Chapman
Sent: Wednesday, November 02, 2011 12:07 PM
To: Yves Savourel
Cc: xliff@lists.oasis-open.org
Subject: RE: [xliff] Segmentation as core or not

It almost read like what the localization industry is used to call "segment" is really a "partition". Basically something that have been cut, classified but could be further divided or broken off into finer fragments? Since I have only been involved in localization topic for the last 3-4 years, I am probably close to the un-tainted eyes.

To me, a segment in the localization world is something that usually have something to do with payment. That is, even if one is paying a service by words, the cost of each word can still be determined by the complexity of a segment. (e.g. length etc.)

From: Yves Savourel <ysavourel@enlaso.com>
To: Helena S Chapman/San Jose/IBM@IBMUS
Cc: <xliff@lists.oasis-open.org>
Date: 11/01/2011 11:02 PM
Subject: RE: [xliff] Segmentation as core or not

Hi Helena,

I guess theoretically it would be possible to have an entire chapter in one “part”. But the extraction tools would not likely do that. Even when there is no sentence-based segmentation the extractors do break down the content into much smaller parts; typically the equivalent of paragraphs for document-type files, or strings for UI-type file.

Actually quite a few tools, especially for software, don’t go beyond that type of segmentation. If you look at many tools for PO files, or Java properties files for examples: Their entries are not often sentence-segmented. And they create TMX files where the entries are called “segments”.

Others may correct me, but I think calling those extracted parts “segments” is simply a relatively common practice.

Personally I think the important thing is to be very clear on what those “part” are, regardless how we end up calling the elements. That said we should obviously pick a name that is not too confusing.
It seems “segment” has been used for a while to mean both the container of something un-segmented and segmented (see for example TMX’s <seg>), but maybe I’ve been too deep in TMX/XLIFF/etc. for too long to see the world with un-tainted eyes :)

Hope this helps,
-yves

From: Helena S Chapman [mailto:hchapman@us.ibm.com]
Sent: Tuesday, November 01, 2011 7:52 PM
To: Yves Savourel
Cc: xliff@lists.oasis-open.org
Subject: Re: [xliff] Segmentation as core or not

Yves, I want to make sure I understand your view point. Based on what you suggested, it is possible for one to have an entire chapter or book as a single *part* when pass it around in an XLIFF file? If so, why call it a segment?

<unit id='1'>
<part>
<source>Sentence one. Sentence two. Sentence three. .... Sentence two thousand and forty five.</source>
</part>
</unit>

Best regards,

Helena Shih Chapman
Globalization Technologies and Architecture
+1-720-396-6323 or T/L 938-6323
Waltham, Massachusetts

From: Yves Savourel <ysavourel@enlaso.com>
To: <xliff@lists.oasis-open.org>
Date: 11/01/2011 04:56 PM
Subject: [xliff] Segmentation as core or not
Sent by: <xliff@lists.oasis-open.org>

Hi all,

To continue on the discussion whether the "segmentation" feature is core or not:

I think Dave has an obviously valid point when saying that segmentation is not necessarily done at the time of the extraction, and therefore we could have un-segmented XLIFF.

But to me a "segment" is not necessarily the result of a segmentation process it can be a "block" extracted from the original format (as our definition states: http://wiki.oasis-open.org/xliff/OneContentModel#Definitions.2BAC8-Terminology).
So each un-segmented entry is, by nature a segment, that simply contains potentially several sentences.

Maybe things would more clear if we think about the element <segment> as a "part" rather than a "segment"? The Segmentation representation addresses how to organize and manipulate such parts.

<unit id='1'>
<part>
<source>Sentence one. Sentence two.</source>
</part>
</unit>

<unit id='1'>
<part>
<source>Sentence one. </source>
</part>
<part>
<source> Sentence two.</source>
</part>
</unit>

Maybe, viewed from that angle it's more clear that such element needs to be part of the core?

Cheers,
-ys

---------------------------------------------------------------------
To unsubscribe, e-mail: xliff-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: xliff-help@lists.oasis-open.org

Follow-Ups:
- RE: [xliff] Segmentation as core or not
  - From: Yves Savourel <ysavourel@enlaso.com>
- RE: [xliff] Segmentation as core or not
  - From: "Rodolfo M. Raya" <rmraya@maxprograms.com>
- Core vs. Module (was RE: [xliff] Segmentation as core or not)
  - From: "Schnabel, Bryan S" <bryan.s.schnabel@tektronix.com>

References:
- Re: [xliff] Segmentation as core or not
  - From: Helena S Chapman <hchapman@us.ibm.com>
- RE: [xliff] Segmentation as core or not
  - From: Yves Savourel <ysavourel@enlaso.com>
- RE: [xliff] Segmentation as core or not
  - From: Helena S Chapman <hchapman@us.ibm.com>
- RE: [xliff] Segmentation as core or not
  - From: "Rodolfo M. Raya" <rmraya@maxprograms.com>