xliff message

Subject: [xliff] comments on dtd
From: Yves Savourel <ysavourel@translate.com>
To: xliff@lists.oasis-open.org
Date: Mon, 28 Jan 2002 15:03:59 -0700
Hi David and all,

I've grouped my comments to several of you last emails together to make it
easier.


-----
>YS> Could you explicit what you call 'EXTRACT' and 'MERGE'?

>DL> Yes. 'EXTRACT' is the operation of taking the translatable text strings
out of the source document - which should be SGML in some form, hopefully
XML, and putting them into a singular file which will be made ready for the
translators. This would include all text strings that had been machine
translated 'successfully'. The source document structure is saved. 'MERGE'
is the operation of getting the translatable strings back into the source
document structure, which can then be used to generate the final output
document.

YS> OK, so we do talk about the same think. With a slight addition: in the
case of XLIFF the extraction/merge is not just for source documents that are
in SGML/XML, but also resource file, properties files, databases fields,
etc.


-----
>YS> XLIFF uses any appropriate encoding as defined by XML specs. The
mechanism to indicate the encoding used in the translated XLIFF document is
the standard XML encoding declaration.

>DL> I have seen many problems arise in the merge process, when character
maps have been unexpectedly encoded into the human-translated text. This has
especially happened when the translator was using an Apple Mac machine, and
has used MSWord for whatever reason, whether or not whilst using the Trados
WorkBench tool. If the character map information were to be captured and
made part of the metadata of the file. It does not appear that the XML
encoding declarations handle this. This applies to part (b) of your comment
as well, and I urge that looking at it more closely would be done at this
stage, and would result in a handler being included into the spec.

YS> I'm not sure these types of problems could be solved by having an
attribute in the XLIFF document stating what encoding should/will be used to
merged the translated text.
- The tool used to open the XLIFF document and present it to the translator
is responsible to do the relevant conversion from the encoding used by the
XLIFF file to whatever encoding is appropriate in the translation
environment. It is also responsible for saving the translated text back in
the XLIFF document correctly.
- Then, the tool used to merge the translation back into the original format
is responsible for selecting (or having the user select) the appropriate
encoding for the merged file.


-----
>YS> Multilingual files even cause problems in the process: most of the time
you have to split the file per translator anyway.

>DL> My experience is different. I was involved in very large scale
production of translated documents with many (up to 26) target languages per
project. They all operated off the same 'EXTRACT' (file split). I suggest
that this is the bulk of the use of commercial translation, at least at the
end where producers will be motivated to purchase new technologies that
facilitate increased through-put, and hence represent quick ROI.

YS> Working on large project with many languages is indeed very common.
XLIFF allows you to work on all of them from the same extraction. What I
meant to say was that having the translations in the same documents may not
be always efficient: they have to be splitter during translation anyways.


-----
>DL> non-UTF-8 imported entities; eg. SAE Gen, etc. I have that posted
(url="http://business.virgin.net/david.leland/markup/sgml/saegen"). I can
email the others, or post them. They are especially used in the automotive
industry, a large consumer of translation services.

YS> Thanks for the example David: it clarify things. The way XLIFF would
deal with entities references that are not Unicode characters would probably
be to use an inline element. For example, an original data such as:
"<para>Capacity: 5 &litre;</para>"
would be coded in XLIFF something like:
<source>Capacity: 5 <ph id="1">&amp;litre;</ph></source>
or
<source>Capacity: 5 <x id="1"/></source>
(with the actual data in the skeleton file).


-----
>YS> You lost me with "SIO"

>DL> Sorry, one forgets how proprietary, or at least parochial, a field of
business really does become. In the automotive translation business, that's
'storage information object'. It usually refers to an illustration, of which
there are hundreds for any given project. One example of an SIO is this:
SIO example. SGML_id="n128978"Frozen: "N" "1999","X200","18","000","genproc"

YS> Illustrations, graphics, etc. that are embedded in the flow of a text
would be treated as inline codes as well. If they are parts of a document as
external data, there is a way also to support them by using <bin-unit> etc.
Basically like you would do with a bitmap in a resource file.


-----
>YS> I think we didn't make required [the use of xml:lang] because you could
avoid to have it for one of the two language in the document by specifying
the xml:lang at the level and it would be redundant. But my memory is fuzzy
on that topic. Other may recall the discussion we had on this.

>DL> I suggest that the name for the element should be lengthened to
something like 'xml:source-lang' or 'xml:target-lang' [or more appropriately
'xml:target-lang-01'], to avoid the redundancy problem.

YS> I'm not sure the W3C would see with a good eyes our TC decide on new
attributes for the reserved xml namespace :)


-----
>JR> The assumption is that the target and source will both be encoded the
same. Usually in UTF-8. However, some mechanism for indicating a different
encoding in the target may be useful.

>JL> It's actually not possible to use the same character map for many
languages. If you were to presume ANSI 1951 for all languages, you would
limit XLIFF's application to ISOLatin 1 - 4 character sets. That bars
Arabic, Chinese, Thai, Japanese and Korean, as well as the Cyrillic
character sets, and other Slavic languages. These languages represent very
important markets for producers of goods, and they need to render the
translated text for those languages. I urge that the spec would have the
capability to deal with this in it's first iteration.

YS> XLIFF doesn't prevent you to use any encoding. Any number of languages
can be represented in an XLIFF document encoded as iso-8859-1, even
Japanese, etc. the non-supported characters are simply converted to NCRs
(character numeric references). And this text can be converted to whatever
encoding is appropriate in the merged document. See for example a Japanese
XLIFF document example at: http://www.opentag.com/xliff.htm#Examples and a
generic XML file with many different languages at
http://www.opentag.com/xmli18n/Chap_02/MultiLang.xml. (in this case that
file is in utf-8, but it could be in any encoding and the characters would
be still correct).


Kind regards,
-yves
References:
- Re: RE: [xliff] comments on dtd
  - From: dcpleland@ftnetwork.com