OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

xliff message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: RE: [xliff] Segmentation as core or not


Hi Yves,

I never thought about extracting the text from one format and creating an XLIFF file as a form of segmentation. But I guess you are dividing the source into parts.  I agree that "segment" and "segmentation" are overloaded words which have evolved into having different meanings to different people.  So I think we must be consistent on the words we use.  One of the definitions I found for the word "extract" was "to take or copy out (matter), as from a book".  So I prefer to think of "extraction" as the process of taking the text out of the original source format, modifying that text so that it is compatible with a new output format (i.e. replacing inline items with XLIFF inline elements), and creating a new output file.  A set of guidelines could be developed on how to extract the text.  Does the XLIFF 1.2 spec have guidelines for this activity?  "Segmentation" would be the process to take a block of text and divide it into smaller parts (segments).  The text itself is not modified, it is only divided.  SRX was developed to control the segmentation of text, but has nothing to do with the extraction of text.

Your example helps to separate these two activities.


<unit> defines the extracted text parts.  <segment> is redundant and provides no additional information at this point, so it should not be required.  

Once some type of segmentation is performed (whether it be sentence segmentation or segmentation based on some other rules), then the <unit> is further divided into translatable segments:


David

Corporate Globalization Tool Development
EMail:  waltersd@us.ibm.com          
Phone: (507) 253-7278,   T/L:553-7278,   Fax: (507) 253-1721

CHKPII:                    http://w3-03.ibm.com/globalization/page/2011
TM file formats:     http://w3-03.ibm.com/globalization/page/2083
TM markups:         http://w3-03.ibm.com/globalization/page/2071


Inactive hide details for Yves Savourel ---11/03/2011 02:56:22 PM---Hi Dave, > This simple XLIFF file should not contain segmenYves Savourel ---11/03/2011 02:56:22 PM---Hi Dave, > This simple XLIFF file should not contain segmentation


    From:

Yves Savourel <ysavourel@enlaso.com>

    To:

<xliff@lists.oasis-open.org>

    Date:

11/03/2011 02:56 PM

    Subject:

RE: [xliff] Segmentation as core or not

    Sent by:

<xliff@lists.oasis-open.org>




Hi Dave,

> This simple XLIFF file should not contain segmentation
> information because some tools may not care about
> segmentation.

I agree that some tools may not care about segmentation information. But they still should be able to work with files that have segmentation information. For example a spell-checker should work with or without segmentation. But that is beside your point.

I think an XLIFF document has always some segmentation formation. It's just not always the result of a sentence segmentation process.


> In  my opinion, the XLIFF "core" elements should be
> the minimum set of elements which are required to
> extract the source text from the original file
> format in such a way that the source text can be
> replaced by the translated text in the original file,
> and the translated file will be usable by the product.

I tend to agree with that.


> 3. Identify each contiguous block of text based on
> the source file's formatting rules.

I think this is the key: your #3 is segmentation representation. Extracting entries from any file is implicitly a segmentation process. It's not the same as trying to segment by sentences, which you may do later, but it is a form of segmentation.

A good proof of that is that somehow in your example, the filter decided that the content of <bi> should not have its own <trans-unit>: you are already applying some kind of segmentation rules.

The only difference is that such initial extraction-driven segmentation is usually not labeled as such and people tend to see only the sentence segmentation.


Imagine that your file is now in XLIFF 2.0. It is as follow:

<unit id="1">
<segment>
 <source>This is my document title</source>
</segment>
</unit>
< unit id="2">
<segment>
 <source>Document's short description</source>
</segment>
</unit>
<unit id="3">
<segment>
 <source>This document describes how the user is to use product <ph id="1"/>.  The first step is to press the <pc id="2">start</pc> button; there are no other actions.</source>
</segment>
</unit>

Then you decide to apply sentence segmentation and you end up with:

<unit id="1">
<segment>
 <source>This is my document title</source>
</segment>
</unit>
< unit id="2">
<segment>
 <source>Document's short description</source>
</segment>
</unit>
<unit id="3">
<segment>
 <source>This document describes how the user is to use product <ph id="1"/>.  </source>
</segment>
<segment>
 <source>The first step is to press the <pc id="2">start</pc> button; there are no other actions.</source>
</segment>
</unit>

Going from the first representation to the second is not really segmenting, it re-segmenting.

In addition, all entries are using the same representation regardless whether or not they have gone through additional segmentation after the extraction. This makes it very easy for tools (even XSLT ones) to work with the text without having to worry about looking at different kind of elements.

This also has the drawback of not being able to know the (re-)segmentation status of the entries with a single <segment>, as Rodolfo pointed out yesterday. But we can come up with some solution for that.

-ys



---------------------------------------------------------------------
To unsubscribe, e-mail: xliff-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: xliff-help@lists.oasis-open.org





[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]