OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

xliff message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]

Subject: RE: [xliff] Segmentation as core or not

Hi David,

> ...taking the text out of the original source format,
> modifying that text so that it is compatible with a 
> new output format (i.e. replacing inline items with 
> XLIFF inline elements), and creating a new output 
> file"

It seems your definition of extraction includes an implicit segmentation:

When you "take the text out of the original source format" you have to take a selection of the original content and that selection has to be based on some rules. They may not be called segmentation rules (or people may not think about them as segmentation rules), but that's what they are. Using <segment> in the output simply makes the result of that implicit segmentation explicit. Which is a good thing.

> "Segmentation" would be the process to take a block 
> of text and divide it into smaller parts (segments).

It sounds reasonable. And that definition can fit the "take the text out of the original source format" of your definition of extraction.

> SRX was developed to control the segmentation 
> of text, but has nothing to do with the 
> extraction of text.

But SRX is not the only thing that drives segmentation. For example the <withinTextRule/> element of ITS would be used in your XML example to make <bi> an element within its parent rather than a separate <unit>. It would also be used to specify sub-flows.

Those are segmentation rules. See the ITS specification section 6.8 (http://www.w3.org/TR/its/#elements-within-text) and the requirement #25 in the work document used to define the requirement for ITS: http://www.w3.org/TR/2006/WD-itsreq-20060518/#elemseg

It's not about sentence-segmentation, but it's clearly about segmentation.

> <unit> defines the extracted text parts.
> <segment> is redundant and provides no additional
> information at this point, so it should not 
> be required.  

"<unit> defines the extracted text parts" ...which initially correspond to single segments.
And <segment> hold one segment.

Having a single segment in the unit is just one case among others. It just happens that it's the case existing just after "extraction", before you apply further (optional) segmentation rules.

If we say <segment> shouldn't be required when there is only one segment, then we could apply the same logic (they are redundant and provides no additional information) to the <source> elements and say they are not required when there is no target.

Both <segment> and <source> may look useless when a unit is made of a single segment and has no target, but making them optional would cause tools (and the schema) to have to deal with complicated conditions. It's much simpler and efficient to make them required.

Finally, even if <segment> was optional when the content has not been sentence-segmented. I still think it would be a fundamental part of the XLIFF structure: states, translation candidates, comments, and many other aspects of XLIFF need to be set at the segment level and therefore they could not exists efficiently without <segment>. In other words so many optional modules would need the "segment representation module", that it will make more sense to have the segmentation representation always available.


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]