OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

xliff message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]

Subject: RE: [xliff] Segmentation as core or not

Hi Helena,


There is a confusion in terminology. Changing the element name to <part> helps in visualization but doesn’t solve the issue at hand.


An XLIFF file is a container for text extracted for localization. If there isn’t text to localize, there is no XLIFF because there is nothing to Interchange (the “L” and “I” in XLIFF are failing).


In many cases, the text extracted for localization needs to be further partitioned to facilitate the translation process. There are cases in which translators prefer to translate paragraphs of text because it produces better translations. In other cases (probably the majority of cases), translators prefer to translate sentences because it facilitates TM matching and translation reuse. The process of splitting extracted text into sentences is known as “segmentation”.


The issue listed in the wiki related to segmentation deals with division of extracted text into “segments” and rearrangement of the segmented text when the boundaries detected by an automated process are not suitable according to the preferences of the translator.


Segmentation can be done during text extraction, when the XLIFF file is created, or in a second pass after the XLIFF has been created. Segmentation also happens at translation time when translators merge or split existing segments.


An XLIFF file must have containers for the extracted text. Having those containers is not a “feature”, it is a necessity. Being able to split the text and store the “segments”, “parts” or “fragments” in the same XLIFF can be viewed as a feature that may be qualified as “core” or “module”.


The proposal currently in the wiki doesn’t make it easy to differentiate between text that has been “extracted” and text that has been “extracted and segmented”. If we had a clear distinction between just extracted and segmented we would be able to tell if the segmentation process and its result belongs to the “core” or “module” category.


When segmentation is done while the XLIFF file is being generated, each segment can be represented as a unit for translation. That was the original way of working with XLIFF 1.0 and 1.1. In XLIFF 1.2 the notion of representing segmentation in the XLIFF document was introduced.


Working with XLIFF 1.2 you can have a segmented file with each <trans-unit> containing one segment or you can have files that contain multiple segments in a <trans-unit> element, each of them enclosed in special markup designed with a combination of <seg-source> and <mrk> elements.


The model for representing segmentation  introduced in XLIFF 1.2 has several problems that must be fixed in XLIFF 2.0.


The proposal for using <unit>, <segment> and <ignorable> that we have in current draft of the XLIFF schema allows representing segmentation. The problem with the schema is that it does not tell you if the text contained in the XLIFF file has been just extracted or extracted and segmented.


The work you did with Yves in the wiki helps in understanding the status of the extracted text. With the attributes, elements and processing expectations you designed it is possible to know if the text has been segmented, if further segmentation is allowed and what restrictions apply. It’s a very nice design.


The discussion is about the qualification of your work. Is it essential of is it optional? If essential, that’s a “core” feature and the used elements and attributes should be in the main XML Schema and documented as integral part of XLIFF. If  representing segmentation is an optional goal, then those elements and attributes should live in a separate optional XML Schema (a “module”) and documented in an annex of the specification or in a separate guideline.


In my personal opinion, representing segmentation as was designed should be a required part of the XLIFF 2.0 standard. I would call it a “core” feature.




Rodolfo M. Raya       rmraya@maxprograms.com


From: xliff@lists.oasis-open.org [mailto:xliff@lists.oasis-open.org] On Behalf Of Helena S Chapman
Sent: Wednesday, November 02, 2011 12:07 PM
To: Yves Savourel
Cc: xliff@lists.oasis-open.org
Subject: RE: [xliff] Segmentation as core or not


It almost read like what the localization industry is used to call "segment" is really a "partition". Basically something that have been cut, classified but could be further divided or broken off into finer fragments? Since I have only been involved in localization topic for the last 3-4 years, I am probably close to the un-tainted eyes.

To me, a segment in the localization world is something that usually have something to do with payment. That is, even if one is paying a service by words, the cost of each word can still be determined by the complexity of a segment. (e.g. length etc.)

From:        Yves Savourel <ysavourel@enlaso.com>
To:        Helena S Chapman/San Jose/IBM@IBMUS
Cc:        <xliff@lists.oasis-open.org>
Date:        11/01/2011 11:02 PM
Subject:        RE: [xliff] Segmentation as core or not

Hi Helena,
I guess theoretically it would be possible to have an entire chapter in one “part”. But the extraction tools would not likely do that. Even when there is no sentence-based segmentation the extractors do break down the content into much smaller parts; typically the equivalent of paragraphs for document-type files, or strings for UI-type file.
Actually quite a few tools, especially for software, don’t go beyond that type of segmentation. If you look at many tools for PO files, or Java properties files for examples: Their entries are not often sentence-segmented. And they create TMX files where the entries are called “segments”.
Others may correct me, but I think calling those extracted parts “segments” is simply a relatively common practice.
Personally I think the important thing is to be very clear on what those “part” are, regardless how we end up calling the elements. That said we should obviously pick a name that is not too confusing.
It seems “segment” has been used for a while to mean both the container of something un-segmented and segmented (see for example TMX’s <seg>), but maybe I’ve been too deep in TMX/XLIFF/etc. for too long to see the world with un-tainted eyes :)
Hope this helps,
From: Helena S Chapman [mailto:hchapman@us.ibm.com]
Tuesday, November 01, 2011 7:52 PM
Yves Savourel
Re: [xliff] Segmentation as core or not

Yves, I want to make sure I understand your view point. Based on what you suggested, it is possible for one to have an entire chapter or book as a single *part* when pass it around in an XLIFF file? If so, why call it a segment?

<unit id='1'>
<source>Sentence one. Sentence two. Sentence three. .... Sentence two thousand and forty five.</source>

Best regards,

Helena Shih Chapman
Globalization Technologies and Architecture
+1-720-396-6323 or T/L 938-6323
Waltham, Massachusetts

Yves Savourel <ysavourel@enlaso.com>
11/01/2011 04:56 PM
[xliff] Segmentation as core or not
Sent by:        


Hi all,

To continue on the discussion whether the "segmentation" feature is core or not:

I think Dave has an obviously valid point when saying that segmentation is not necessarily done at the time of the extraction, and therefore we could have un-segmented XLIFF.

But to me a "segment" is not necessarily the result of a segmentation process it can be a "block" extracted from the original format (as our definition states:
So each un-segmented entry is, by nature a segment, that simply contains potentially several sentences.

Maybe things would more clear if we think about the element <segment> as a "part" rather than a "segment"? The Segmentation representation addresses how to organize and manipulate such parts.

<unit id='1'>
<source>Sentence one. Sentence two.</source>

<unit id='1'>
<source>Sentence one. </source>
<source> Sentence two.</source>

Maybe, viewed from that angle it's more clear that such element needs to be part of the core?


To unsubscribe, e-mail:
For additional commands, e-mail:

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]