To: "ysavourel@translate.com" <ysavourel@translate.com>
Date: Mon, 28 Jan 2002 08:23:36 -0800
Hi Yves;
Responding to your comments:
>I guess we should
open the discussion on what schemas to use besides the DTD.
I feel that's appropriate.
>Could you explicit what you call
'EXTRACT' and 'MERGE'?
Yes. 'EXTRACT' is the operation of taking the translatable text strings out of the source document - which should be SGML in some form, hopefully XML, and putting them into a singular file which will be made ready for the translators. This would include all text strings that had been machine translated 'successfully'. The source document structure is saved. 'MERGE' is the operation of getting the translatable strings back into the source document structure, which can then be used to generate the final output document.
>
************************************************************
ysavourel@translate.com wrote on 1/23/02 11:56:16 PM
************************************************************
Thanks for posting those comments David.
I'll try to answer a few of them. Not having working together yet there is
maybe some terms we don't use the same way: if I'm not clear, please, let me
know and I'll try to re-formulate.
> 1. document validators - we should have support for W3C Schema, Schematron
and RELAX NG, as well as DTD.
I agree that we should have different ways to specify XLIFF so different
people using different tools can have easy access to it. We can probably
generate some of those schemas (or at leats a base to work from) from the
DTD using converters as Christain showed me yesterday. I guess we should
open the discussion on what schemas to use besides the DTD.
This appears to have been considered in the creation of the , and elements. Is that correct?
>
-----
> 2. Does not have entities for EXTRACT and MERGE.
-----
I'm not sure I understand the note. Could you explicit what you call
'EXTRACT' and 'MERGE'? Maybe the following description of XLIFF with regard
to extraction and merging will help:
An XLIFF document stores initially the result of an extraction. The original
input is split into 2 main streams: the localizable data are in the content
of and in various attributes (coord, etc.). Some original code can
also be encapsulated withing using all the inline elements: ,
, , . The rest of the non-localizable data is stored in the
"skeleton". The skeleton is a separate file that can be either referenced
from the XLIFF document (using the element with an
element), or embedded in a element (still in the
element).
The translated file is reconstructed (merged) from the skeleton (whereever
it is located) and the content of the elements (which have been
added during the localization process).
>XLIFF uses any appropriate encoding as defined by XML specs.
The mechanism to indicate the encoding used in the translated XLIFF document
is the standard XML encoding declaration.
I have seen many problems arise in the merge process, when character maps have been unexpectedly encoded into the human-translated text. This has especially happened when the translator was using an Apple Mac machine, and has used MSWord for whatever reason, whether or not whilst using the Trados WorkBench tool. If the character map information were to be captured and made part of the metadata of the file. It does not appear that the XML encoding declarations handle this. This applies to part (b) of your comment as well, and I urge that looking at it more closely would be done at this stage, and would result in a handler being included into the spec.
>
-----
> 3. Does not have entities for character map used in saved file (from
translation).
-----
I see two different meanings here, I'll re-pharse the comment two different
ways to see which one (if any) is the right one:
a) "XLIFF doesn't have a way to indicate what encoding has been used for the
translated text."
That's true: XLIFF uses any appropriate encoding as defined by XML specs.
The mechanism to indicate the encoding used in the translated XLIFF document
is the standard XML encoding declaration.
b) "XLIFF doesn't have a way to indicate what encoding should be used for
the translated text when merging the text into the original format."
That's also true: the assumption (maybe incorrect) is that, knowing which
type of format, which language and which platform the text is targeted for,
the merger tool is responsible for using the appropriate encoding (possibly
with the help of the end-user). This is consistent with how most current
localization tools work. We may need to look at this more closely.
>multilingual files even cause problems in the process: most
of the time you have to split the file per translator anyway.
My experience is different. I was involved in very large scale production of translated documents with many (up to 26) target languages per project. They all operated off the same 'EXTRACT' (file split). I suggest that this is the bulk of the use of commercial translation, at least at the end where producers will be motivated to purchase new technologies that facilitate increased through-put, and hence represent quick ROI. I urge that the XLIFF spec have this capability in its first iteration for that reason.
>Maybe someone from the "Status-Flags" sub-group can
address this and give example?
Who are they? Will they please identify themselves when sending an explanation?
>non-UTF-8 imported entities; eg. SAE Gen, etc.
I have that posted (url="http://business.virgin.net/david.leland/markup/sgml/saegen"). I can email the others, or post them. They are especially used in the automotive industry, a large consumer of translation services.
>You lost me with "SIO"
Sorry, one forgets how proprietary, or at least parochial, a field of business really does become. In the automotive translation business, that's 'storage information object'. It usually refers to an illustration, of which there are hundreds for any given project.
One example of an SIO is this:
SIO example.
SGML_id="n128978"Frozen: "N" "1999","X200","18","000","genproc"
I hope this progresses the discussion. I've been offline for a bit, and shall try to catch up with all the comments.
Regards,
David L
-----
> 4. Target lang should be target+ in 'ELEMENT trans-unit', unless that's
not intended for the whole job. [Inquiry: what is 'ELEMENT trans-unit'
intended to handle?]
-----
The element is the place where the source and one translation
of a given localizable item is stored. An 'item' is not defined beyond being
(most of the time) a run of translatable text. For example it can be a
string from a Windows RC stringtable group, the value of a key/value pair of
a Java properties file, the content of a
element in HTML, the value of a
alt attribute in HTML, etc.
Actually a is allowed to have empty and . This
is to hanlde cases where the localizable data is not text but other
information: coordinates of a control for example, it needs to be
represented in case some tools provide capability such as resizing, etc.
XLIFF does not address explicitely anything related to segmentation.
XLIFF is intended to handle a source language and ONE target language in
each element. This is a decision that was made very early in the
design of the format, and the structure of XLIFF reflect that (otherwise we
wouldn't have that / pair for example). The main reason (as
far as can recall) was that the advantages of having multilingual files
where not that big to be worth the complication. In addition it seems that,
in some cases, multilingual files even cause problems in the process: most
of the time you have to split the file per translator anyway. I'm sure other
will be able to elaborate why a simple bilingual architecture was chosen
rather than a multilingual one.
The use of "target?" (zero or one target) rather than "target+" (one target)
is there to allow with only a source text. I think it was
"target?" at the beginning and we changed it to "target+". Comments anyone?
-----
> 5. Does not have QC/Proofer captured.
-----
I think this is captured in the element. That element is there to
allow tools to flag the progress of the document through the localization
process, and even keep track of the changes through links using the
phase-name attribute. Maybe someone from the "Status-Flags" sub-group can
address this and give example?
-----
> 6. Will need to support non-UTF-8 imported entities (eg. SAE Gen, Fordsym,
TEI)
-----
I'm not sure if I understand this well. Could you elaborate and maybe give
an example?
-----
> 7. Should support SIO, and have more atts needed for inline elements.
-----
Same here. You lost me with "SIO" :) Does it stands for "Serial Input
Output", "Shift-In (shift)-Out"? Could you elaborate and maybe give a few
examples.
Thanks for taking the time to go through this David. Hopefully other will be
able to elaborate my answers and possibly address the points I failed
(miserably) to understand.
Kind regards,
-yves
----------------------------------------------------------------
To subscribe or unsubscribe from this elist use the subscription
manager:
___________________________________________________________________________________________________________________________ Get your free e-mail account at http://www.ftnetwork.com Visit the web site of the Financial Times at http://www.ft.com