OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

xliff message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]

Subject: RE: [xliff] Segmentation as core or not

Hi Helena:
The minimum unit of text that you translate is called "segment" in the translation industry.
A "segment" could be a complete novel, a chapter, a paragraph, a sentence or just a bunch of words.
When text is extracted from a file and placed in an XLIFF file, it is stored in elements that contain "segments". Large segments, small segments, we don't care.
Some tools create segments that must be further processed and split into smaller pieces a a later time. For example, some tools extract paragraphs, put them on the XLIFF file and then, in a second pass, split these paragraphs into sentences.
Other tools create segments that contain the dimension selected by the end user before the XLIFF is created. My tools, for example, ask the user if paragraph or sentence level segmentation is desired and then extract the text, partition the text at sentence level if the user wants it, and finally create the XLIFF file.
Whatever is stored in an XLIFF file when it is created, is already segmented at some level. The original segmentation may or may not need to be adjusted.
So, as you said, there are two different processes to consider:
1) Extraction of translatable text and placement in the XLIFF file, forming the original segments
2) Optional refinement of original segments to fit the partitioning desired by the end user

Task 1) needs to be supported by XLIFF core and the required structure must be placed in the main specification.
Task 2) can be done using the same elements and attributes used in part 1) but may need additional elements and attributes. Those additional elements and attributes could be considered a module.
So far I see many TC members confusing task 2), which actually could be called "re-segmentation", with the segmentation that naturally happens when an XLIFF file is created in task 1).
The minimum set of elements that we need for task 1) is already included in the schema draft. These elements are: <file>, <unit>, <source>, <target> and the inline elements to be added (we have a good set being defined).
To avoid the problems that exist today in XLIFF 1.2, I added 2 elements that I consider important: <segment> and <ignorable>.
Having <segment> and <ignorable> as defined in the schema allows us to change segmentation at any time, without affecting conversion of the XLIFF file to original format when the translation is completed. This is a big step forward to compatibility.
If we put the layer that allows adjusting segmentation (<segment> and <ignorable> for the time being) in an optional module, we will be repeating the same mistake done with the introduction of <seg-source> as optional layer in XLIFF 1.2. It will be necessary to duplicate source text in the translation unit in order to know what is translatable and what isn't. Current compatibility issues will persist.
Unfortunately, we have to stop here and define what we want to do before moving on. We cannot wait until the semantics of core and module have been defined.
I cannot continue working on the schema and specification if the TC doesn't know if the proposed element tree is OK or not.
I need to know if this basic structure is accepted by all members (or at least the majority) before moving on:
 - An XLIFF document contains 1 or more <file> elements that contain one or more portions of translatable text stored in <unit> elements.
 - A <unit> contains one or more <segment> elements and zero, one or more <ignorable> elements.
 - Translatable text is stored in one <source> element contained in each <segment> element and its corresponding translation is stored in a sibling <target> element.
 - Portions of text that should not be translated but must remain in the XLIFF file are stored in one <source> element contained in <ignorable> elements.
 - Optional translations for <source> child of <ignorable> can be stored in sibling <target> elements
The above list summarizes what we have so far in the schema draft. If it is not accepted, we need to restart our work on XLIFF 2.0 introducing a new design.
Rodolfo M. Raya
Maxprograms http://www.maxprograms.com
-------- Original Message --------
Subject: RE: [xliff] Segmentation as core or not
From: Helena S Chapman <hchapman@us.ibm.com>
Date: Tue, November 08, 2011 1:12 pm
To: "Rodolfo M. Raya" <rmraya@maxprograms.com>
Cc: xliff@lists.oasis-open.org

Rodolfo. You brought up an interesting point "To apply segmentation process to an already existing XLIFF file is an optional task. Recording that such task has been performed is the  optional part. For the process to be possible, the text must already be in the XLIFF file and it has to be in some containers. "

I believe we are talking about two very distinct process activities here: 1. partition content into parts (core) 2. refine the definition of #1 into segments (module)

I agree any existing XLIFF file will already include "parts" of content. How these parts were defined by what tools is something the module can then define. For example, one might expect metadata about what the parts mean according to other standard or non-standard definition. For example, word vs sentence according to UAX#29 or paragraph vs chapter based on Acme Translation Agency Inc. internal definition? The latter is what Steven is referring to as logging.

We definitely should rethink the taxonomy of what we call "segmentation" today. Note that I didn't use the word "terminology" to further pollute the conversation.

Best regards,

Helena Shih Chapman
Globalization Technologies and Architecture
+1-720-396-6323 or T/L 938-6323
Waltham, Massachusetts

From:        "Rodolfo M. Raya" <rmraya@maxprograms.com>
To:        <xliff@lists.oasis-open.org>
Date:        11/08/2011 04:37 AM
Subject:        RE: [xliff] Segmentation as core or not
Sent by:        <xliff@lists.oasis-open.org>


I think there is a huge confusion between the segmentation process and storing segments in XLIFF.

Text extracted for translation and stored in an XLIFF file needs to be stored in some elements that act as containers. If XLIFF doesn't have containers for holding localizable text, then the localizable text can't be exchanged and the "L" and "I" fail in the XLIFF acronym.

Extracted text can be segmented before the XLIFF file is created (my tools have been doing this for years) or after the XLIFF has been created. A tool processing XLIFF files should not care about when segmentation was done. More, the segmentation process is completely optional.

To apply segmentation process to an already existing XLIFF file is an optional task. Recording that such task has been performed is the  optional part. For the process to be possible, the text must already be in the XLIFF file and it has to be in some containers.

Storing translatable text in XLIFF files is not optional. Elements for holding that text are required and elements for holding the translations of that text are also an integral part of XLIFF.

What we have so far in the XLIFF schema draft is a set of elements and attributes for holding translatable text and its translations.

In the schema we don't have information that indicates how and when segmentation process occurred.

In the wiki we have a proposal for decorating current schema draft with elements and attributes containing information about the segmentation process. The proposal in the wiki augments the scope of the basic elements already present in the schema draft by adding attributes and processing expectations to elements that must be present in any XLIFF file.

Although some attributes mentioned in the segmentation section in the wiki are not really necessary when an XLIFF file is created, the elements in which they appear are absolutely necessary. We can't document an element as part of the "core" schema and leave some of its attributes as optional in a separate "module".

Minimalism is a fancy trend. I like it very much and see it useful in some cases. We should not try to apply minimalism to the concept of XLIFF core; this would be a mistake as big as the mistake in XLIFF 1.2 that enabled custom extensions everywhere.

Balance is important.

Rodolfo M. Raya       rmraya@maxprograms.com

> -----Original Message-----
> From: xliff@lists.oasis-open.org [
mailto:xliff@lists.oasis-open.org] On Behalf
> Of Yves Savourel
> Sent: Tuesday, November 08, 2011 3:24 AM
> To: xliff@lists.oasis-open.org
> Subject: RE: [xliff] Segmentation as core or not
> Hi Steven, all,
> > We discussed this a little bit in IBM today.
> > Our view would still be that segmentation does not need to be in core
> > for interchange.
> I think most (all hopefully) of us would probably agree that one important
> criteria for an optional module is  that it does not prevent the tools
> implementing only the core to work properly.
> So if the representation of sentence-segmentation is optional it should not
> prevent a tool XYZ, which understands only the core elements, to work.
> The question then is how does tool XYZ can work with a sentence-
> segmented file without knowing about <segment>?
> <unit id='1'>
>  <segment>
>   <source>Sentence one. </source>
>  </segment>
>  <segment>
>   <source>Sentence two.</source>
>  </segment>
> </unit>
> I don't think it can.
> The only way it could, would be if a unit was to store two copies of the same
> content: one not sentence-segmented, and the other one reserved for the
> tools that would implement the optional segmentation representation
> module.
> Needless to say this would result in a slew of troubles: Where does tool ABC
> (which implements segmentation) puts its translation? How tools XYZ (which
> does not implement segmentation) can access it? How do we resolve
> difference in source? Where do we put segment status? etc. Basically it's all
> the problems of 1.2 all over again. In 1.2 we had no choice because we
> needed to be backward compatible. But 2.0 we can have a clean way of
> dealing with segments.
> So far, the only rationale I've heard for making <segment> optional, is the
> argument that segmentation is a different process and therefore should not
> be part of the core. But I think we have seen that segmentation in general is
> broader than sentence-segmentation and clearly happens also during
> extraction (see the example with ITS <withinTextRule/>), so that rationale
> doesn't really hold true.
> But maybe I'm missing other things: what are the advantages of keeping the
> segmentation representation optional?
> Cheers,
> -yves
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xliff-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: xliff-help@lists.oasis-open.org

To unsubscribe, e-mail: xliff-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: xliff-help@lists.oasis-open.org

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]