RE: [xliff] Proposal for Segmentation Notation in XLIFF

Hi David,

Thank you for your feedback. I will try to comment on the different issues you have raised one-by-one, to make the discussion a bit easier to follow.

1) The reason for our choice of the <mrk> element to represent segmentation, as opposed to introducing a new element (e.g. <seg>) is that we feel that the introduction of a new element inside the <target> content would cause too much potential incompatibility issues with existing XLIFF implementation. We felt that the introduction of a new element inside <target> would severely affect the possibilities of our proposal being accepted as an amendment to the XLIFF 1.1 specification. However we did also agree that the introduction of a specific element for representing segmentation would be beneficial for the standard in the next major version, and when we start our work on that I will raise this topic.

2) Regarding the issue of segment boundaries it is important to note that we must not make assumptions about what kind of segmentation will be represented. The purpose of the segmentation is to increase recycling rates when used with tools such as translation memories. Different CAT tools and translation memories use different segmentation algorithms, and not all of them require the entire text content to be segmented. Quite often markup such as tags also affect the segment boundaries. Some segmentation algorithms may e.g. choose to exclude tags or formatting that appears before and after a sentence in the segment, while others don’t. It is important to leave this flexibility to the segmentation tools rather than enforcing a particular approach in the XLIFF standard. I would also like to point out that it is still the entire content of the <target> element that makes up the actual full translation, rather than what is in the individual segments. The segments are there to aid certain tools in safe recycling of content on sub-<trans-unit> level.

3) Regarding the use of SRX this topic has also been discussed in the segmentation sub-committee. In our most recent discussion our conclusion was that the specifics of embedding and/or referencing SRX is a topic that should be pursued by the main XLIFF committee, in particular as it is likely to involve closer cooperation with other standards groups.

4) The issue of whether segments should be represented by elements spanning the segment content was also discussed in detail over a longer time period in the subcommittee. In the proposal we all voted 100% for in the end we chose our suggested approach of using <mrk> elements to span the segment content. Here are some of the reasons:

a. It is important to use a representation that is easy to process. This approach has many benefits in this respect. In particular XML DOM-based tools can be used to process content, which is not easily achievable with some of the other suggested approaches.

b. The issue with non-clonable <g> elements represents a bigger problem than allowing or not allowing segmentation. If non-clonable <g> elements are used in a way that the content they span may include more than single words or isolated expressions they represent highly localisation unfriendly content, and they are very likely to cause difficult problems during translation. Being able to break a segment inside such an element may be the smallest of the problems that tools would be faced with. In this case it is actually rather an advantage that segmentation is not allowed at such points, as the non-clonable <g> element clearly represents a piece of content that must be translated as one piece, no matter what. Perhaps I can illustrate what I mean with an example translation from English to “Yoda-English” (for Star Wars fans):

<source>This is a <g>sentence. It has</g> markup.</source>

The translation into “Yoda-English” would be:

<target>A <g>sentence</g> this is. Markup <g>it has</g>.</target>

However if the <g> element cannot be cloned this is not possible, and as a result the content cannot be correctly localised. This is in fact irrespective of whether segments are introduced here or not.

I hope this addresses your questions and concerns, and I look forward to an interesting discussion on this topic later today.

Best regards,

Magnus Martikainen

From: David Pooley [mailto:dpooley@sdl.com]
Sent: Monday, March 14, 2005 4:42 AM
To: 'xliff'
Subject: RE: [xliff] Proposal for Segmentation Notation in XLIFF

I'm more than a little concerned that non-clonable <g> elements prohibit segmentation of text. I'm also unclear as to why it is necessary to potentially exclude any text from the original <source> when marking the segment boundaries. In this case, we can have the situation where the sum of the parts does not equal the whole. If SRX (which is based on Unicode TR-29) is being considered to use with XLIFF, this standard defines where a segmentation break should occur; not where a segment begins and ends. As such, there's no provision for excluding text once it is segmented. Given the amount of assumed functionality that is being passed on to the XLIFF editor I think it would be reasonable to assume that this editor would also be capable of stripping unwanted whitespace from the start or end of the segment where necessary.

Is there a documented reason why the <mrk> element was chosen to represent segmentation and not a new, empty element such as <seg/>?

David Pooley
Software Architect
SDL International

-----Original Message-----
From: Magnus Martikainen [mailto:Magnus@trados.com]
Sent: Tuesday, March 08, 2005 7:22 PM
To: xliff
Subject: [xliff] Proposal for Segmentation Notation in XLIFF

Hi all,

The segmentation subcommittee has voted unequivocally to put forward the following proposal to the main XLIFF committee on how to represent segmentation in XLIFF files.

I would hereby like to request a formal review of the proposal by the XLIFF Committee for its inclusion in the XLIFF draft specification.

The following document explains and details the proposed changes to the XLIFF specification:

http://www.oasis-open.org/apps/org/workgroup/xliff-seg/download.php/11359/seg-proposal.htm

Best regards,

Magnus Martikainen

on behalf of the XLIFF Segmentation Subcommittee

xliff message