[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: RE: [xliff-seg] Segmentation representation and scenario
Hi Yves, Thanks for your suggestion! There are definitely quite a few interesting benefits with this implementation option. I have a few comments: 1) Logically there is a strong coupling between the source and target version of a segment (the <mrk> in the <source> and the corresponding <mrk> in the <target>). This connection is in my opinion not explicit enough in your suggested representation, since it cannot be represented in the schema. At least to my knowledge there is no way to define a schema or DTD to validate the file to ensure that each segment in the <source> has a corresponding segment in the <target> and vice versa. 2) There are a number of properties that apply to individual source and target segment pairs, which cannot easily be represented in a natural way that shows their connection to the segments. a) Of the current attributes available for <trans-unit> at least the following would be useful to have also for segments: approved, translate, phase-name. b) Alternative translations, fuzzy matches, etc. as represented by <alt-trans> for the <trans-unit>. Though would be possible to use <alt-trans> on the <trans-unit> level to store these matches that would be misleading, as they don't apply to the entire <trans-unit>. For example an <alt-trans> can be an exact match for a segment, but it cannot be marked as an exact match because that would be interpreted as an exact match for the <trans-unit>. 3) If <g> is used inside the <trans-unit> and spans a segment boundary we cannot use this implementation, as it would violate XML. Instead we would need to resort to using empty <mrk/> elements to show the start and end of the segments, perhaps something like this: <trans-unit id="9"> <source xml:lang="en"><mrk mid="9-1" mtype="segment-start"/>This is <g>the first sentence. <mrk mid="9-1" mtype="segment-end"/> <mrk mid="9-2" mtype="segment-start"/>Second part</g> of the segment. <mrk mid="9-2" mtype="segment=end"/></source> ... </trans-unit> This works, but it is not pretty. It makes the coupling between segment markers even looser, both in source and target. With this representation it is not even possible to use a schema to validate that each segment start has a corresponding segment end. The use of <g> will cause problems for segmentation whichever representation we choose, and it may in fact turn out that this representation is one of the few that can handle it in some way at all... Cheers, Magnus -----Original Message----- From: Yves Savourel [mailto:ysavourel@translate.com] Sent: Tuesday, May 11, 2004 7:17 AM To: XLIFF Segmentation Subject: [xliff-seg] Segmentation representation and scenario Some ideas on segmentation representation: For representing the segmentation inside a <trans-unit> I would use the <mrk> element: <trans-unit id='2'> <source xml:lang='en'><mrk mid='2-1' mtype='phrase'>This is the second entry of the file.</mrk> <mrk mid='2-2' mtype='phrase'>This is the second sentence of the second entry.</mrk></source> <target xml:lang='fr'><mrk mid='2-1' mtype='phrase'>Ceçi est la première entrée du fichier.</mrk> <mrk mid='2-1' mtype='phrase'>Ceçi est la seconde phrase de la première entrée.</mrk></target> </trans-unit> - It's part of the existing specifications. - It's un-intrusive: mergers are suppose to ignore it. - We can have a set of specific extended attributes if we want to store sentence-level information. - We would probably need to add a mtype value specific for a 'segment' ('phrase' is not good enough). I agree that translation tools should be able to provide there own segmentation within a <trans-unit> and that during the translation itself (by the translator). I also think that a translation tool should be able to use any existing match at the <trans-unit> level as well: there is no reason to go to a finer granularity if a match is already available at the <trans-unit> level. This said, there is obviously a threshold of usability for fuzzy matches at the <trans-unit> level. And that threshold is most likely commensurable to the size of the text in the <trans-unit> (as for large units the differences between the new source and the old one may be more difficult to see). I think a translation process should be able to take advantage of such high matches obtained without the translation tool and without segmentation of the <trans-unit> content. Translation tools should allow the verification of such matches during the translation. For example: one can imagine a project where version 2 of a software is to be localized. A version 1 with translation exists, but no TM. One can easily create a "TM" without complexe tools for <trans-unit> level entries. One should be able to re-use high matches of that "TM" regardless what segmentation is use by the translation tools. Cheers, -yves
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]