OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

xliff-seg message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: RE: [xliff-seg] Views on segments and segmentation


Hi all,

 

As discussed in the last SC meeting we want to modify the definitions below to clarify what exactly “text based data” and “linguistically suitable for translation” represents. Some comments raised:

 

  • It is not clear from the term “text based data” that this can include mark-up-codes, or in fact may contain no text at all.
  • “Linguistically suitable for translation” on its own does not mean that the entire content is divided into pieces that can be individually translated.

 

 

How about something like:

Segment

(n) “A piece of linguistic data (which may contain embedded encoded information such as mark-up) whose boundaries have been determined in the process of segmentation.”

(v) “To carry out segmentation.”

 

Segmentation

(n) “The division of linguistic data (which may contain embedded encoded information such as mark-up) into segments, where the content of each segment is a linguistically correct unit.”

 

I realise that this is a bit too complex, and I’m looking forward to reading your comments on how this can be improved.

 

Best regards,

Magnus

 


From: Magnus Martikainen [mailto:magnus@trados.com]
Sent: Tuesday, March 30, 2004 2:16 PM
To: Lieske, Christian; 'xliff-seg@lists.oasis-open.org'
Subject: RE: [xliff-seg] Views on segments and segmentation

 

Hi Christian,

 

Thank you for starting this thread!

Here are my comments on this topic:

 

A) In my opinion a segment could perhaps be better defined as something like:

(n) “A piece of text based data that is linguistically suitable for translation.”

Such a definition allows for different types of segmentation, such as sentence based segmentation, paragraph based segmentation, and even phrase and term based segmentation.

Note that the word “segment” can also be used as a verb (e.g. “to segment a file”), in which case it could be defined as something like:

(v) “The process of dividing text based data into (segments) / (pieces that are individually linguistically suitable for translation).”

 

Segmentation in turn could be defined as something like:

(n) “The division of text based data into (segments) / (pieces that are individually linguistically suitable for translation).”

 

For optimal reuse of previous translations, e.g. through a translation memory tool, experience shows that in most cases it is most efficient to use sentence based segmentation, though there are cases where paragraph segmentation or phrase segmentation can yield better results. Term based segmentation yields problems in that even though the terms themselves may be suitable for individual translation it is often the case that the surrounding text (without the terms) is difficult to treat as segments, since they do not always make linguistic sense without the terms themselves.

 

B) Regarding SRX we touched upon this in our first and second sub-committee meetings.

SRX is a standard for expressing segmentation rules for data in TMX format. Thus we would need to present the data in TMX compliant format in the XLIFF files in order to fully be able to apply SRX. The conclusion was that we need to look into the possibility of introducing TMX as a namespace in XLIFF files. For this to be possible TMX must have an XML schema. Currently there is only a DTD available, and Yves has an action item to push the TMX committee to provide a schema that we could use. He will bring this up in the next TMX committee meeting.

 

Looking forward to additional comments on this topic!

 

Magnus

 


From: Lieske, Christian [mailto:christian.lieske@sap.com]
Sent: Wednesday, March 24, 2004 11:55 PM
To: 'xliff-seg@lists.oasis-open.org'
Subject: [xliff-seg] Views on segments and segmentation

 

Dear all,

 

During the sub-committee meeting on 23-Mar-04, I developed the feeling that I and possibly others would benefit from a discussion related to the notion of 'segment' and 'segmentation'.

 

Since the statement of purpose for the sub-committee reads

 

"The XLIFF Segmentation Subcommittee goal is to recommend segmentation representations within an XLIFF document."

 

a common understanding of these notions seems to be vital.

 

Best regards,

Christian

 

A. Segment

 

One way of starting a discussion, is to look at a kind of standard definition for 'segment' in the realm of localization:

 

"A segment is what a program considers the smallest translatable unit, usually a sentence."

 

Starting from here, I wonder what to say about a French phrase like

 

"Chaque patient à l'hôpital a une carte vitale"

 

Here, 'carte vitale' is a term (along the lines of English "health insurance card"), sth. which from my understanding is the smallest translatable unit. Accordingly, I would see a concatenation of two segments in the French phrase.

 

One question which thus comes to mind when I think about the sub-committee is the following: Would the work of the sub-committee result in a recommendation like "Phrases which contain terms which are available in a glossary attached to an XLIFF file are not to be split into different segments"?

 

B. Segmentation

 

I wonder how for example the sub-committee's work is related to TR-29 of the Unicode standard (see http://www.unicode.org/reports/tr29/) or the ongoing work at LISA related to the Segmenation Rules Exchange format (SRX). I have got the feeling that many observers will ask questions like this.

 



[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]