[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: RE: [xliff-seg] Views on segments and segmentation
Hi all, As discussed in the last SC meeting we
want to modify the definitions below to clarify what exactly “text based
data” and “linguistically suitable for translation”
represents. Some comments raised:
How about something like: Segment (n) “A
piece of linguistic data (which may contain embedded encoded information such
as mark-up) whose boundaries have been determined in the process of
segmentation.” (v) “To
carry out segmentation.” Segmentation (n) “The
division of linguistic data (which may contain embedded encoded information
such as mark-up) into segments, where the content of each segment is a
linguistically correct unit.” I realise that this is a bit too complex,
and I’m looking forward to reading your comments on how this can be
improved. Best regards, Magnus From:
Hi Christian, Thank you for starting this thread! Here are my comments on this topic: A) In my opinion a segment could perhaps be better defined as
something like: (n)
“A piece of text based data that is linguistically suitable for
translation.” Such a definition allows for different
types of segmentation, such as sentence based segmentation, paragraph based
segmentation, and even phrase and term based segmentation. Note that the word “segment”
can also be used as a verb (e.g. “to segment a file”), in which
case it could be defined as something like: (v)
“The process of dividing text based data into (segments) / (pieces that
are individually linguistically suitable for translation).” Segmentation in turn could be defined as something like: (n)
“The division of text based data into (segments) / (pieces that are
individually linguistically suitable for translation).” For optimal reuse of previous
translations, e.g. through a translation memory tool, experience shows that in
most cases it is most efficient to use sentence based segmentation, though
there are cases where paragraph segmentation or phrase segmentation can yield
better results. Term based segmentation yields problems in that even though the
terms themselves may be suitable for individual translation it is often the
case that the surrounding text (without the terms) is difficult to treat as
segments, since they do not always make linguistic sense without the terms
themselves. B) Regarding SRX we touched upon this in
our first and second sub-committee meetings. SRX is a standard for expressing
segmentation rules for data in TMX format. Thus we would need to present the
data in TMX compliant format in the XLIFF files in order to fully be able to
apply SRX. The conclusion was that we need to look into the possibility of
introducing TMX as a namespace in XLIFF files. For this to be possible TMX must
have an XML schema. Currently there is only a DTD available, and Yves has an
action item to push the TMX committee to provide a schema that we could use. He
will bring this up in the next TMX committee meeting. Looking forward to additional comments on
this topic! Magnus From:
Lieske, Christian [mailto:christian.lieske@sap.com] Dear all, During the sub-committee meeting on 23-Mar-04, I developed the
feeling that I and possibly others would benefit from a discussion related to
the notion of 'segment' and 'segmentation'. Since the statement of purpose for the sub-committee reads "The XLIFF Segmentation Subcommittee goal is to recommend segmentation representations within an XLIFF document." a common understanding of these notions seems to be vital. Best regards, Christian A. Segment One way of starting a discussion, is to look at a kind of
standard definition for 'segment' in the
realm of localization: "A segment
is what a program considers the smallest translatable unit, usually a
sentence." Starting from here, I wonder what to say
about a French phrase like "Chaque patient à l'hôpital a une carte vitale" Here, 'carte vitale'
is a term (along the lines of English "health insurance card"), sth.
which from my understanding is the smallest translatable unit. Accordingly, I
would see a concatenation of two segments in the French phrase. One question which thus comes to mind when
I think about the sub-committee is the following: Would the work of the
sub-committee result in a recommendation like "Phrases which contain terms
which are available in a glossary attached to an XLIFF file are not to be split
into different segments"? B. Segmentation I wonder how for example the sub-committee's work is related
to TR-29 of the Unicode standard (see http://www.unicode.org/reports/tr29/)
or the ongoing work at LISA related to the Segmenation Rules Exchange format
(SRX). I have got the feeling that many observers will ask questions like this. |
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]