xliff-comment message

Subject: RE: [xliff-comment] using xliff to translate html

From: "Yves Savourel" <ysavourel@translate.com>
To: "Brian Stell" <brianstell00@aol.com>, xliff-comment@lists.oasis-open.org
Date: Thu, 4 Sep 2003 15:11:19 -0600

Hi Brian,

I'll try to answer your questions:


> Is it intended that the internal structure of
> the source/target elements to show up as
> elements in the XLIFF DOM?

Yes.


> If XLIFF is intended to be a holder then I'm
> unclear on the advantages of forcing the
> source/target data to be well formed XML.

Yes, it is the intent of XLIFF to be a holder of text with possibly inline
codes.

One of the aims of XLIFF is to *abstract* the translation unit that have
inline codes, so that, regardless what the original codes are, they can be
processed in a uniform way for most localization tasks (translation memory
matching, spell-cheching, word counting, terminology extraction, etc.)

A small example:

Original code in RTF:
"The picture is {\b missing}."
XLIFF content:
<source>The picture is <bpt id='1'>{\b </bpt>missing<ept
id='1'>}</ept>.</source>

Original code in HTML:
"The picture is <B>missing</B>."
XLIFF content:
<source>The picture is <bpt id='1'>&lt;B></bpt>missing<ept
id='1'>&lt;/B></ept>.</source>

The idea is that, in both cases, the XLIFF content is equivalent, already
parsed (from the original format point of view). In other words: text is
already separated from codes. Actually, using the <g> tags you could even
write the content for both formats:

<source>The picture is <g id='1'>missing</g>.</source>

This will allow tools to treat the inline codes without distinction. For
example, we could get a 100% match when leveraging the RTF text in a HTML
file.


> Would there be an advantage to allowing or
> making source/target data CDATA? It would
> remove the requirement that the  source/target
> data be well formed XML. In my case this would
> make the handling of HTML much much simpler.

If we had a content as CDATA:

<source><![CDATA[The picture is {\b missing}.]]></source>
<source><![CDATA[The picture is <B>missing</B>.]]></source>

all the translation tools would have to come up with their own parsing for
both formats (and any other format), and this at each time they manipulate
the source/target content.

The need for pre-parsing come from the goal of having a common way to
understand and manipulate the inline codes, regardless of the original
format (HTML, RTF, MIF, RC, RESX, Java properties, JSP, Photoshop files,
etc.).

Keep also in mind that, like for other formats, only the inline elements of
HTML (<b>, <em>, img>, etc.) will be in the source/target content, not any
of the structural elements (<table>, <li>, <tr>, etc.). From a translation
tool viewpoint there is no reason to treat them differently from other
format.


I hope this is helpful.

Kind regards,
-yves

References:
- using xliff to translate html
  - From: "Brian Stell" <brianstell00@aol.com>