xliff message

Subject: Re: RE: [xliff] comments on dtd

From: dcpleland@ftnetwork.com
To: "jreid@novell.com" <jreid@novell.com>
Date: Mon, 28 Jan 2002 08:42:51 -0800

Hi John: On your responses that are not embodied in Yves': >We could become bogged down in a discussion of schemas when the work at hand is to approve or improve the current spec; with the multiplicity of schemas we could spend months on this topic alone. It is a good sub-committee discussion. I suggest that we be agnostic, and deal with W3C, RELAX NG and Schematron. That way there is no need to discuss and decide on which to favour. >The assumption is that the target and source will both be encoded the same. Usually in UTF-8. However, some mechanism for indicating a different encoding in the target may be useful. It's actually not possible to use the same character map for many languages. If you were to presume ANSI 1951 for all languages, you would limit XLIFF's application to ISOLatin 1 - 4 character sets. That bars Arabic, Chinese, Thai, Japanese and Korean, as well as the Cyrillic character sets, and other slavic languages. These languages represent very important markets for producers of goods, and they need to render the translated text for those languages. I urge that the spec would have the capability to deal with this in it's first iteration. >The strongest argument against multilingual XLIFF (more than one target language) was the versioning problem. It would be too difficult to keep the languages in sync. I don't see a problem. Each language translation job becomes a separate project, once the translatable text strings have been extracted. They are merged back individually into the document structure file to produce the finished document in each language. Please see my communication with Yves on the issue of multiple target languages. Regards, David Leland *********************************************************** jreid@novell.com wrote on 1/24/02 5:54:34 AM ************************************************************ Hi All, A review of the spec may clear up some of the points better than this discussion. I understand it wasn't easily available before our meeting and some may not have had a chance to review it. I would hope that those that haven't already done so to read it. It is now posted at our TC website on OASIS. I would like to elaborate a little on Yves's answers, as follows. >>> Yves Savourel 1/23/02 4:56:16 PM >>> Thanks for posting those comments David. I'll try to answer a few of them. Not having working together yet there is maybe some terms we don't use the same way: if I'm not clear, please, let me know and I'll try to re-formulate. > 1. document validators - we should have support for W3C Schema, Schematron and RELAX NG, as well as DTD. I agree that we should have different ways to specify XLIFF so different people using different tools can have easy access to it. We can probably generate some of those schemas (or at leats a base to work from) from the DTD using converters as Christain showed me yesterday. I guess we should open the discussion on what schemas to use besides the DTD. This isn't a weakness in the spec; the spec simply describes the dictionary. The DTD and schema are artifacts of the spec. It is in the charter to create a schema; the schema type is unspecified. We could become bogged down in a discussion of schemas when the work at hand is to approve or improve the current spec; with the multiplicity of schemas we could spend months on this topic alone. It is a good sub-committee discussion. ----- > 2. Does not have entities for EXTRACT and MERGE. ----- I'm not sure I understand the note. Could you explicit what you call 'EXTRACT' and 'MERGE'? Maybe the following description of XLIFF with regard to extraction and merging will help: An XLIFF document stores initially the result of an extraction. The original input is split into 2 main streams: the localizable data are in the content of and in various attributes (coord, etc.). Some original code can also be encapsulated withing using all the inline elements: , , , . The rest of the non-localizable data is stored in the "skeleton". The skeleton is a separate file that can be either referenced from the XLIFF document (using the element with an element), or embedded in a element (still in the element). The translated file is reconstructed (merged) from the skeleton (whereever it is located) and the content of the elements (which have been added during the localization process). Specific extract and merge entities/elements have purposely been undefined. The method of obtaining localizable data in the XLIFF file varies by publisher. Some use databases which contain the localizable data and some will use skeleton files. Others may use yet another system. Because we don't want to impose process on the publisher, we've tried to allow for any process that can produce valid XLIFF. This allows for a great deal of flexibility to the publisher. There are elements defined which are available to the publiisher for these purposes. From the spec, "The element contains tool-specific information used in combining the data with the skeleton file or storing the data in a repository." The contains the element, which contains the actual tool-specific data. There is also the ts attribute of the following elements: , , , , , , , , , , , , , , , , , . From the spec, "The ts attribute allows you to include short data understood by a specific toolset." In addition, the element allows for information of this nature, also. ----- > 3. Does not have entities for character map used in saved file (from translation). ----- I see two different meanings here, I'll re-pharse the comment two different ways to see which one (if any) is the right one: a) "XLIFF doesn't have a way to indicate what encoding has been used for the translated text." That's true: XLIFF uses any appropriate encoding as defined by XML specs. The mechanism to indicate the encoding used in the translated XLIFF document is the standard XML encoding declaration. b) "XLIFF doesn't have a way to indicate what encoding should be used for the translated text when merging the text into the original format." That's also true: the assumption (maybe incorrect) is that, knowing which type of format, which language and which platform the text is targeted for, the merger tool is responsible for using the appropriate encoding (possibly with the help of the end-user). This is consistent with how most current localization tools work. We may need to look at this more closely. Do you mean XLIFF does not have a mechanism for having different encoding of the target from the source? If so, that is true. The assumption is that the target and source will both be encoded the same. Usually in UTF-8. However, some mechanism for indicating a different encoding in the target may be useful. ----- > 4. Target lang should be target+ in 'ELEMENT trans-unit', unless that's not intended for the whole job. [Inquiry: what is 'ELEMENT trans-unit' intended to handle?] ----- The element is the place where the source and one translation of a given localizable item is stored. An 'item' is not defined beyond being (most of the time) a run of translatable text. For example it can be a string from a Windows RC stringtable group, the value of a key/value pair of a Java properties file, the content of a

element in HTML, the value of a alt attribute in HTML, etc. Actually a is allowed to have empty and . This is to hanlde cases where the localizable data is not text but other information: coordinates of a control for example, it needs to be represented in case some tools provide capability such as resizing, etc. XLIFF does not address explicitely anything related to segmentation. XLIFF is intended to handle a source language and ONE target language in each element. This is a decision that was made very early in the design of the format, and the structure of XLIFF reflect that (otherwise we wouldn't have that / pair for example). The main reason (as far as can recall) was that the advantages of having multilingual files where not that big to be worth the complication. In addition it seems that, in some cases, multilingual files even cause problems in the process: most of the time you have to split the file per translator anyway. I'm sure other will be able to elaborate why a simple bilingual architecture was chosen rather than a multilingual one. The use of "target?" (zero or one target) rather than "target+" (one target) is there to allow with only a source text. I think it was "target?" at the beginning and we changed it to "target+". Comments anyone? The properly allows only zero or one target for any . The DTD has it as target?. Alternate translations can be stored in the element which contains target+. The targets in the alt-trans can come from a variety of places including translator versions and TMs. There is only one allowable target in a trans-unit because that is considered the current or final version. The strongest argument against multilingual XLIFF (more than one target language) was the versioning problem. It would be too difficult to keep the languages in sync. ----- > 5. Does not have QC/Proofer captured. ----- I think this is captured in the element. That element is there to allow tools to flag the progress of the document through the localization process, and even keep track of the changes through links using the phase-name attribute. Maybe someone from the "Status-Flags" sub-group can address this and give example? Yves is quite correct about this. Maybe Tony can give you access the to the DataDefinition Yahoo group so that you can see our discussions on that topic. ----- > 6. Will need to support non-UTF-8 imported entities (eg. SAE Gen, Fordsym, TEI) ----- I'm not sure if I understand this well. Could you elaborate and maybe give an example? ----- > 7. Should support SIO, and have more atts needed for inline elements. ----- Same here. You lost me with "SIO" :) Does it stands for "Serial Input Output", "Shift-In (shift)-Out"? Could you elaborate and maybe give a few examples. Please elaborate points 6 & 7. Thanks for taking the time to go through this David. Hopefully other will be able to elaborate my answers and possibly address the points I failed (miserably) to understand. Kind regards, -yves Thanks for looking this over. I hope this explains some things. We need to get everyone access to the discussions on the DataDefinition group site. Cheers, John ---------------------------------------------------------------- To subscribe or unsubscribe from this elist use the subscription manager: ---------------------------------------------------------------- To subscribe or unsubscribe from this elist use the subscription manager:

___________________________________________________________________________________________________________________________
Get your free e-mail account at http://www.ftnetwork.com
Visit the web site of the Financial Times at http://www.ft.com