OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

xliff message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]

Subject: RE: [xliff] Generic mechanism for translation candidate elements and other annotations

Hi Helena,


I am talking about characters here, regardless the number of bytes a character may need for storage.


Offsets and lengths should be measured in character units, where a character is the entity defined by Unicode consortium as the basic unit of encoding for the Unicode character encoding.


We could also use Code Point, as suggested by Yves, (defined by Unicode as any value in the Unicode codespace).




Rodolfo M. Raya       rmraya@maxprograms.com


From: xliff@lists.oasis-open.org [mailto:xliff@lists.oasis-open.org] On Behalf Of Helena S Chapman
Sent: Wednesday, March 07, 2012 2:32 PM
To: Rodolfo M. Raya
Cc: xliff@lists.oasis-open.org
Subject: RE: [xliff] Generic mechanism for translation candidate elements and other annotations


The definition of offset should be tightened. If the source content is in UTF8 and predominantly Japanese or Chinese, what does an offset mean in that context?

From:        "Rodolfo M. Raya" <rmraya@maxprograms.com>
To:        <xliff@lists.oasis-open.org>
Date:        03/07/2012 11:10 AM
Subject:        RE: [xliff] Generic mechanism for translation candidate elements and other annotations
Sent by:        <xliff@lists.oasis-open.org>

> -----Original Message-----
> From: xliff@lists.oasis-open.org [
mailto:xliff@lists.oasis-open.org] On Behalf
> Of Yves Savourel
> Sent: Wednesday, March 07, 2012 1:13 PM
> To: xliff@lists.oasis-open.org
> Subject: RE: [xliff] Generic mechanism for translation candidate elements
> and other annotations
> R> The effect of re-segmentation over matches is not new. This time we
> R> have to add processing expectations that require updating matches
> R> according to the changes.
> It's certainly true. But any change that wouldn't involve keeping the
> information about the original span of content that was associated with the
> match would essentially be a loss of information.
> But maybe that is OK.

Any change in segmentation will have an effect on operations that depend on the original structure of source text, regardless of the annotation model you select.

BTW, we need a mechanism for storing the history of changes done to a <unit>.

> R> There may be a need to know what section of <source> is being matched
> R> and the relevant information should live in the corresponding <match>
> R> element, keeping the original <source> clean. This can be done, for
> R> example, by using 2 attributes: one attribute indicates the offset
> R> where the match starts and the other indicates the length of the text
> R> matched (in both cases ignoring tags).
> Using start/length (or start/end) positions is something that we have not
> explored much.
> It could be a way to replace completely <mrk>.

It deserves to be explored. We should avoid altering <source> as much as possible.

> Two issues come to mind with offsets:
> a) we would need to be extremely strict on how to handle white spaces.
> Currently there is room for choice by the tool.

Sure, we need to be strict regarding the way offsets are measured.

We can, for example, require space normalization for offset and length calculation when xml:space is set to "default". Normalization can be done by replacing every substring composed by multiple white spaces by a single space character.

> b) any change to the content would require an update on all annotations.
> That may be a burdensome processing expectation.

It's also troublesome when you use <mrk>. Adjustments in segmentation imply adjustments in matching or other processes regardless the annotation model you select.

> But it has its advantages too: for example overlapping and superposing spans
> are cleanly handled, unlike with <mrk> where you might have to keep track
> of the nesting order.

The biggest advantage is that <source> doesn't have to be altered with extra elements.
> I think Rodolfo's suggestion also bring up the question of <source> being
> read-only or not.

That was a basic non-written principle from the early days of XLIFF 1.0. I would certainly make <source> read-only, allowing only merge and split operations for segmentation purposes.

> To me a modern XLIFF needs to be able to allow enriching the source
> content. So we have to find a way to annotate both content; whether it's
> using offsets or elements like <mrk>, or another solution.

Offsets offer a clean way, with the advantage that annotations could be easily removed without affecting <source>.

Offsets can also be used to annotate <target> elements without affecting the translation. There is no obstacle for using the same mechanism for <source>,  <target> and perhaps other elements.

Rodolfo M. Raya       rmraya@maxprograms.com

To unsubscribe, e-mail: xliff-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: xliff-help@lists.oasis-open.org

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]