OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

xliff message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: RE: [xliff] Segmentation as core or not


Hi Dave,

> This simple XLIFF file should not contain segmentation 
> information because some tools may not care about 
> segmentation.

I agree that some tools may not care about segmentation information. But they still should be able to work with files that have segmentation information. For example a spell-checker should work with or without segmentation. But that is beside your point.

I think an XLIFF document has always some segmentation formation. It's just not always the result of a sentence segmentation process.


> In  my opinion, the XLIFF "core" elements should be
> the minimum set of elements which are required to 
> extract the source text from the original file 
> format in such a way that the source text can be 
> replaced by the translated text in the original file,
> and the translated file will be usable by the product.

I tend to agree with that.


> 3. Identify each contiguous block of text based on
> the source file's formatting rules.

I think this is the key: your #3 is segmentation representation. Extracting entries from any file is implicitly a segmentation process. It's not the same as trying to segment by sentences, which you may do later, but it is a form of segmentation.

A good proof of that is that somehow in your example, the filter decided that the content of <bi> should not have its own <trans-unit>: you are already applying some kind of segmentation rules.

The only difference is that such initial extraction-driven segmentation is usually not labeled as such and people tend to see only the sentence segmentation.


Imagine that your file is now in XLIFF 2.0. It is as follow:

<unit id="1">
 <segment>
  <source>This is my document title</source>
 </segment>
</unit>
< unit id="2">
 <segment>
  <source>Document's short description</source>
 </segment>
</unit>
<unit id="3">
 <segment>
  <source>This document describes how the user is to use product <ph id="1"/>.  The first step is to press the <pc id="2">start</pc> button; there are no other actions.</source>
 </segment>
</unit>

Then you decide to apply sentence segmentation and you end up with:

<unit id="1">
 <segment>
  <source>This is my document title</source>
 </segment>
</unit>
< unit id="2">
 <segment>
  <source>Document's short description</source>
 </segment>
</unit>
<unit id="3">
 <segment>
  <source>This document describes how the user is to use product <ph id="1"/>.  </source>
 </segment>
 <segment>
  <source>The first step is to press the <pc id="2">start</pc> button; there are no other actions.</source>
 </segment>
</unit>

Going from the first representation to the second is not really segmenting, it re-segmenting.

In addition, all entries are using the same representation regardless whether or not they have gone through additional segmentation after the extraction. This makes it very easy for tools (even XSLT ones) to work with the text without having to worry about looking at different kind of elements.

This also has the drawback of not being able to know the (re-)segmentation status of the entries with a single <segment>, as Rodolfo pointed out yesterday. But we can come up with some solution for that.

-ys




[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]