xliff message

Subject: Re: RE: [xliff] comments on dtd

From: dcpleland@ftnetwork.com
To: "ysavourel@translate.com" <ysavourel@translate.com>
Date: Mon, 28 Jan 2002 08:23:36 -0800

Hi Yves; Responding to your comments: >I guess we should open the discussion on what schemas to use besides the DTD. I feel that's appropriate. >Could you explicit what you call 'EXTRACT' and 'MERGE'? Yes. 'EXTRACT' is the operation of taking the translatable text strings out of the source document - which should be SGML in some form, hopefully XML, and putting them into a singular file which will be made ready for the translators. This would include all text strings that had been machine translated 'successfully'. The source document structure is saved. 'MERGE' is the operation of getting the translatable strings back into the source document structure, which can then be used to generate the final output document. > ************************************************************ ysavourel@translate.com wrote on 1/23/02 11:56:16 PM ************************************************************ Thanks for posting those comments David. I'll try to answer a few of them. Not having working together yet there is maybe some terms we don't use the same way: if I'm not clear, please, let me know and I'll try to re-formulate. > 1. document validators - we should have support for W3C Schema, Schematron and RELAX NG, as well as DTD. I agree that we should have different ways to specify XLIFF so different people using different tools can have easy access to it. We can probably generate some of those schemas (or at leats a base to work from) from the DTD using converters as Christain showed me yesterday. I guess we should open the discussion on what schemas to use besides the DTD. This appears to have been considered in the creation of the , and elements. Is that correct? > ----- > 2. Does not have entities for EXTRACT and MERGE. ----- I'm not sure I understand the note. Could you explicit what you call 'EXTRACT' and 'MERGE'? Maybe the following description of XLIFF with regard to extraction and merging will help: An XLIFF document stores initially the result of an extraction. The original input is split into 2 main streams: the localizable data are in the content of and in various attributes (coord, etc.). Some original code can also be encapsulated withing using all the inline elements: , , , . The rest of the non-localizable data is stored in the "skeleton". The skeleton is a separate file that can be either referenced from the XLIFF document (using the element with an element), or embedded in a element (still in the element). The translated file is reconstructed (merged) from the skeleton (whereever it is located) and the content of the elements (which have been added during the localization process). >XLIFF uses any appropriate encoding as defined by XML specs. The mechanism to indicate the encoding used in the translated XLIFF document is the standard XML encoding declaration. I have seen many problems arise in the merge process, when character maps have been unexpectedly encoded into the human-translated text. This has especially happened when the translator was using an Apple Mac machine, and has used MSWord for whatever reason, whether or not whilst using the Trados WorkBench tool. If the character map information were to be captured and made part of the metadata of the file. It does not appear that the XML encoding declarations handle this. This applies to part (b) of your comment as well, and I urge that looking at it more closely would be done at this stage, and would result in a handler being included into the spec. > ----- > 3. Does not have entities for character map used in saved file (from translation). ----- I see two different meanings here, I'll re-pharse the comment two different ways to see which one (if any) is the right one: a) "XLIFF doesn't have a way to indicate what encoding has been used for the translated text." That's true: XLIFF uses any appropriate encoding as defined by XML specs. The mechanism to indicate the encoding used in the translated XLIFF document is the standard XML encoding declaration. b) "XLIFF doesn't have a way to indicate what encoding should be used for the translated text when merging the text into the original format." That's also true: the assumption (maybe incorrect) is that, knowing which type of format, which language and which platform the text is targeted for, the merger tool is responsible for using the appropriate encoding (possibly with the help of the end-user). This is consistent with how most current localization tools work. We may need to look at this more closely. >multilingual files even cause problems in the process: most of the time you have to split the file per translator anyway. My experience is different. I was involved in very large scale production of translated documents with many (up to 26) target languages per project. They all operated off the same 'EXTRACT' (file split). I suggest that this is the bulk of the use of commercial translation, at least at the end where producers will be motivated to purchase new technologies that facilitate increased through-put, and hence represent quick ROI. I urge that the XLIFF spec have this capability in its first iteration for that reason. >Maybe someone from the "Status-Flags" sub-group can address this and give example? Who are they? Will they please identify themselves when sending an explanation? >non-UTF-8 imported entities; eg. SAE Gen, etc. I have that posted (url="http://business.virgin.net/david.leland/markup/sgml/saegen"). I can email the others, or post them. They are especially used in the automotive industry, a large consumer of translation services. >You lost me with "SIO" Sorry, one forgets how proprietary, or at least parochial, a field of business really does become. In the automotive translation business, that's 'storage information object'. It usually refers to an illustration, of which there are hundreds for any given project. One example of an SIO is this: SIO example. SGML_id="n128978"Frozen: "N" "1999","X200","18","000","genproc" I hope this progresses the discussion. I've been offline for a bit, and shall try to catch up with all the comments. Regards, David L ----- > 4. Target lang should be target+ in 'ELEMENT trans-unit', unless that's not intended for the whole job. [Inquiry: what is 'ELEMENT trans-unit' intended to handle?] ----- The element is the place where the source and one translation of a given localizable item is stored. An 'item' is not defined beyond being (most of the time) a run of translatable text. For example it can be a string from a Windows RC stringtable group, the value of a key/value pair of a Java properties file, the content of a

element in HTML, the value of a alt attribute in HTML, etc. Actually a is allowed to have empty and . This is to hanlde cases where the localizable data is not text but other information: coordinates of a control for example, it needs to be represented in case some tools provide capability such as resizing, etc. XLIFF does not address explicitely anything related to segmentation. XLIFF is intended to handle a source language and ONE target language in each element. This is a decision that was made very early in the design of the format, and the structure of XLIFF reflect that (otherwise we wouldn't have that / pair for example). The main reason (as far as can recall) was that the advantages of having multilingual files where not that big to be worth the complication. In addition it seems that, in some cases, multilingual files even cause problems in the process: most of the time you have to split the file per translator anyway. I'm sure other will be able to elaborate why a simple bilingual architecture was chosen rather than a multilingual one. The use of "target?" (zero or one target) rather than "target+" (one target) is there to allow with only a source text. I think it was "target?" at the beginning and we changed it to "target+". Comments anyone? ----- > 5. Does not have QC/Proofer captured. ----- I think this is captured in the element. That element is there to allow tools to flag the progress of the document through the localization process, and even keep track of the changes through links using the phase-name attribute. Maybe someone from the "Status-Flags" sub-group can address this and give example? ----- > 6. Will need to support non-UTF-8 imported entities (eg. SAE Gen, Fordsym, TEI) ----- I'm not sure if I understand this well. Could you elaborate and maybe give an example? ----- > 7. Should support SIO, and have more atts needed for inline elements. ----- Same here. You lost me with "SIO" :) Does it stands for "Serial Input Output", "Shift-In (shift)-Out"? Could you elaborate and maybe give a few examples. Thanks for taking the time to go through this David. Hopefully other will be able to elaborate my answers and possibly address the points I failed (miserably) to understand. Kind regards, -yves ---------------------------------------------------------------- To subscribe or unsubscribe from this elist use the subscription manager:

___________________________________________________________________________________________________________________________
Get your free e-mail account at http://www.ftnetwork.com
Visit the web site of the Financial Times at http://www.ft.com

Follow-Ups:
- [xliff] comments on dtd
  - From: Yves Savourel <ysavourel@translate.com>