xliff message

Subject: XLIFF 2.0 in 3D (Testing the friendship between XLIFF and TMX)

From: Asgeir Frimannsson <asgeirf@redhat.com>
To: xliff <xliff@lists.oasis-open.org>
Date: Tue, 7 Apr 2009 00:25:21 -0400 (EDT)

Hi all,

I haven't - as far as I remember - made any on-topic comments in the recent TMX<>XLIFF debate, even though I see my name referenced a few places. Please excuse me for thinking aloud:

I have been (and to some extent still happily am) selectively ignorant about the TMX format. The more interesting TM systems out there does far more advanced processing with TM data than what can be represented in the TMX format. In other words, with innovation in the TM space, I believe the value of TMX-based exchange *as we see it today* will decrease or become irrelevant.

I am starting to doubt that finding a common model for inline content will be feasible, as inline content is just one aspects of a larger content model encompassing document meta-data and block-level structure. TMX does not have a graph-based or hierarchical content model that can represent e.g. inline flows of text in separate TUs, and it seems to me that there is a lot of data-loss in TMX, as content is mostly represented as a flat list of segmented translation units. XLIFF 1.x have some of these issues as well, and it would have been nice to eliminate e.g. the need for <sub> within XLIFF.

This brings me to the point of integration with XLIFF. There is always going to be data loss going from XLIFF to TMX, as one is not a superset of the other. We can minimize this loss on the TU/segment-level by designing a similar-enough content model, but I am not convinced of the long-term benefit of *only* doing this.

Most intelligent TM systems on the market today do not convert to TMX or XLIFF before indexing and storing new content. TM systems that go beyond simple segment-matches work directly with native formats or an intermediate internal format. The ongoing challenge for XLIFF adoption is to become rich and flexible enough to capture and represent the wide range of native file formats in a way that enable tool developers to work directly with XLIFF rather than building custom data models for their tools.

Let me present 3 dimensions of an ideal resource format (very intuitive - no rocket science here):

1st Dimension: Monolingual template
- Holds a structural representation of the native format
- The home of source-level segmentation

2nd Dimension: Bilingual template-instance
- Holds the target-language translations
- Can encompass target-language TM suggestions, etc..

3rd Dimension: Multilingual exchange format
- Holds translations for multiple languages

XLIFF 1.x primarily addressed (2) - but also indirectly addressed (1) to some extent. My question is if the feasibility of using a totally separate format for addressing the 3rd dimension is perhaps a bit counter-productive. Would a multilingual XLIFF exchange format help catalyze/drive the adoption of the format? And would this allow greater TM exchange on the resource/document-level going forward?

Thankfully, you can get away with a lot by "thinking aloud" :-) There is of course an immediate need for interoperability between TMX and XLIFF on the inline content level, and naturally TMX has a lot of non-XLIFF related use-cases. TMX 2.0 and the new itag seems to me as a bit of a short-cut or simplification a more complex problem.

cheers,
asgeir