Summary of the comments for the XLIFF 1.1 Review period (Aug-11-2003 to Sep-24-2004).
Note: The threads may be different from the emails as they have been re-arranged by topics.
Note: Parts of some messages have been reformatted for better display. Also, the OASIS email repository has some problems with storing some emails: several are broken down in several pieces, with sometimes, missing paragraphs.
From Brian Stell (AOL) [Original Message]
I'm new to this list so if I ask question that are covered in a FAQ could someone please point me to it?
I'm interested in using XLIFF to hold HTML (not XHTML).
If I read the 1.1 spec correctly it looks like the spec indicates that the source/target elements are to be parsed by the XML parser; ie: http://www.oasis-open.org/committees/xliff/documents/xliff-specification.htm#Struct_Body
This element may contain inline elements that either remove the codes from the source (<g>, <x/>, <bx/>, <ex/>) or that mask off codes left inline (<bpt>, <ept>, <sub>, <it>, <ph>).
I see a suggestion of this in this email: http://lists.oasis-open.org/archives/xliff/200212/msg00031.html
I'm a little confused as it seems like XLIFF is intended to be a holder.
Is it intended that the internal structure of the source/target elements to show up as elements in the XLIFF DOM?
If XLIFF is intended to be a holder then I'm unclear on the advantages of forcing the source/target data to be well formed XML.
Is there an advantage to parsing the data in the source/target elements?
Would there be an advantage to allowing or making source/target data CDATA? It would remove the requirement that the source/target data be well formed XML. In my case this would make the handling of HTML much much simpler.
From Yves Savourel (RWS) [Original Message]
Hi Brian,
I'll try to answer your questions:
Ref> Is it intended that the internal structure of the source/target elements to show up as elements in the XLIFF DOM?
Yes.
Ref> If XLIFF is intended to be a holder then I'm unclear on the advantages of forcing the source/target data to be well formed XML.
Yes, it is the intent of XLIFF to be a holder of text with possibly inline codes.
One of the aims of XLIFF is to *abstract* the translation unit that have inline codes, so that, regardless what the original codes are, they can be processed in a uniform way for most localization tasks (translation memory matching, spell-checking, word counting, terminology extraction, etc.)
A small example:
Original code in RTF:
"The picture is {\b missing}."
XLIFF content:
<source>The picture is <bpt id='1'>{\b </bpt>missing<ept id='1'>}</ept>.</source>Original code in HTML:
"The picture is <B>missing</B>."
XLIFF content:
<source>The picture is <bpt id='1'><B></bpt>missing<ept id='1'></B></ept>.</source>The idea is that, in both cases, the XLIFF content is equivalent, already parsed (from the original format point of view). In other words: text is already separated from codes. Actually, using the <g> tags you could even write the content for both formats:
<source>The picture is <g id='1'>missing</g>.</source>
This will allow tools to treat the inline codes without distinction. For example, we could get a 100% match when leveraging the RTF text in a HTML file.
Ref> Would there be an advantage to allowing or making source/target data CDATA? It would remove the requirement that the source/target data be well formed XML. In my case this would make the handling of HTML much much simpler.
If we had a content as CDATA:
<source><![CDATA[The picture is {\b missing}.]]></source>
<source><![CDATA[The picture is <B>missing</B>.]]></source>all the translation tools would have to come up with their own parsing for both formats (and any other format), and this at each time they manipulate the source/target content.
The need for pre-parsing come from the goal of having a common way to understand and manipulate the inline codes, regardless of the original format (HTML, RTF, MIF, RC, RESX, Java properties, JSP, Photoshop files, etc.).
Keep also in mind that, like for other formats, only the inline elements of HTML (<b>, <em>, <img>, etc.) will be in the source/target content, not any of the structural elements (<table>, <li>, <tr>, etc.). From a translation tool viewpoint there is no reason to treat them differently from other format.
From Yves Savourel (RWS) [Original Message]
Hi everyone,
Just to add more on the discussion, I've attached is an example of HTML extracted to XLIFF:
- TestPage.htm is the original file.
- TestPage.xlf is the XLIFF extraction.
- TestPageOutput.htm is the original file with the extracted parts marked up between braces ('{...}').
This illustrates a few things:
- Only translatable text and codes within translatable text are extracted.
- Using inline codes in <source> (and <target>) allows XLIFF to present a common representation of the codes to all tools, regardless of the original format of the codes. But at the same time, the original code is available if need to be.
- It is possible to have some context information by using the existing values for restype. However, it is arguable that a better context information would be the actual name of the tag (or the sequence of tags) from where the text was extracted. For example, here there is no distinction between the text extracted from a title attribute and the content of a <caption> element, they are both annotated with restype='caption' (there is no 'title' value in the current list of restype values). It's likely that we will need to add new values to the restype list or find a better mechanism (using <context> for example).
- Sub-flow text (e.g. the text of the alt attribute in a <img> element) can be extracted as a separate entry. Another way to do this would be to use the <sub> element inside the <bpt> element containing the image.
Note this is just one way of mapping HTML to XLIFF, but as it was already mentioned the TC needs to develop a set of guidelines for this, so everyone could do the same mapping.
From Brian Stell (AOL) [Original Message]
Yves,
Ref> ... Just to add more on the discussion, I've attached is an example of HTML extracted to XLIFF:
- TestPage.htm is the original file.
- TestPage.xlf is the XLIFF extraction.Thank you for the examples.
If you don't mind I'd like to discuss a couple of items.
Readability of the inline marked up text
The inline markup seems to add a lot of bulk to the text. Consider this paragraph from TestPage.htm:
<p>Some text in <b>bold</b>, <i>italic</i>, <b><i> bold and italic</i></b>, and a <a href="#topoffile">link</a>.</p>
The inline marked up text in TestPage.htm becomes(1):
<source xml:lang='en'>Some text in <bpt id='1' ctype='bold'><b></bpt>bold<ept id='1'></b></ept>, <bpt id='2' ctype='italic'><i></bpt>italic <ept id='2'></i></ept>, <bpt id='3' ctype='bold'><b></bpt> <bpt id='4' ctype='italic'><i></bpt> bold and italic<ept id='4'></i></ept><ept id='3'></b></ept>, and a <bpt id='5' ctype='link'><a href="#topoffile"></bpt>link<ept id='5'></a></ept>.</source>
Its my current thinking that the original version is fairly readable and I'd be comfortable presenting this to a localizer.
I am very concerned that the inline marked up version is a bit hard to read and I am concerned that it is not appropriate for a localizer to work with.
It would be possible for a tool to convert the data back to the more readable form. However, unless this is a requirement of XLIFF I would expect that we will see at least a generation of tools that present the inline marked up form directly to the translator.
Using the <g> tag
Using the <g> tag this text:
<p>Some text in <b>bold</b>, <i>italic</i>, <b><i> bold and italic</i></b>, and a <a href="#topoffile">link</a>.</p>
would become (something like) this(2):
<g xid="g1">Some text in <g xid="g2">bold</g>, <g xid="g3">italic</g>, <g xid="g2"><g xid="g3"> bold and italic</g></g>, and a <g xid="g4">link</g>.</g>
<trans-unit id='g1'>
<source xml:lang='en'>p</source>
</trans-unit>
<trans-unit id='g2'>
<source xml:lang='en'>b</source>
</trans-unit>
<trans-unit id='g3'>
<source xml:lang='en'>i</source>
</trans-unit>
<trans-unit id='g4'>
<source xml:lang='en'>a href="#topoffile"</source>
</trans-unit>
</trans-unit>I like the <g> tag is a lot less bulky than the <it>,<bpt>,<ept> tags but then the meaning of the tag is removed away from the source element.
Yves Savourel (RWS) [Original Message]
Hello Brian,
Ref> Readability of the inline marked up text.
Yes, the <bpt>, <ept>, <it> and <ph> elements make the segment much larger than the original, even if you strip them down to the minimal (only an id attribute). It's also especially true when the original file is an SGML or an XML file as all the '<' and '&' of the code parts need to be escaped. That is, alas, the price to pay for having the codes pre-parsed and in the
segment itself.As you pointed out one solution for this is to use the <g> and <x/> elements instead. But it has the drawback to not give access to the original code.
I think an XLIFF-enabled tool should be able to provide a display that remove any clutter from the translator view.
The tools that would have more issues with this would be the tools enabled for XML only, which don't know what to do with the content of <bpt> etc. except that they should be protected. Some of those tools, like TagEditor from Trados have option to reduce the display of tags to almost nothing on screen, so --at least in some cases-- the bulk of <bpt>-like elements is not
too much of a problem there.The worst case senarios would be when you translate an XLIFF file in a format where everything is visible, like when adding a color-coded RTF layer to a XLIFF and working with in in Word.
Ref> Using the <g> tag.
The <g> and <x/> tags are meant to be used a little bit differently than the example you gave. The original codes is not inside the XLIFF content (i.e. not in other <trans-unit> elements). Instead, it resides in the skeleton, the file where the non-localizable parts of the original file are kept.
How you do store them there is completely up to the filter creating the XLIFF document. For the time being XLIFF provides only:
- a way to either embed the skeleton in the XLIFF document (<internal-file>) or externalize it (<external-file>).
- a way to link the <g> and <x/> elements back to the corresponding their codes in the skeleton (by using the id attribute).
So to reuse your example, the segment would look like:
<source>Some text in <g id="g2">bold</g>, <g id="g3">italic</g>, <g id="g2"><g id="g3">bold and italic</g></g>, and a <g id="g4">link</g>.</source>
[Note that unlike in your example, the <p> element would not be part of the inline codes as it's a "structural" code and shouldn't be in the segment (or it would cause problem if you wanted to leverage the segment somewhere else).]
As for the original codes ("<b>", "</b>", "<i>", etc.) they would be somewhere in the skeleton.
The main drawback with this mechanism is that, because there is currently no standard way to represent the skeleton, only tools knowing the given type of skeleton created are able to access the original codes if they need to. The others are left with whatever information is in the <g> and <x/> elements.
Will we have a standard skeleton format (as John also asked in his email yesterday)? There has some talks about it in the TC meetings, but so far nothing has been decided.
From Nico van de Water (MITIA) [Original Message]
Hello Brian and Yves,
First, Yves, thanks for your very lucid explanation of these aspects of XLIFF. Although I did a bit of reading up on XLIFF, your examples are brilliant.
However, ideal as an XML / XLIFF environment may seem (especially from a productivity point of view), as someone with nearly 25 years in translation and 17 in localisation, I must emphasise that in most cases an XML / XLIFF environment strips away the translator's most vital aspect of his job, namely the context of the element / sentence / string to translate.
Ref> Keep also in mind that, like for other formats, only the inline elements of HTML (<b>, <em>, img>, etc.) will be in the source/target content, not any of the structural elements (<table>, <li>, <tr>, etc.). From a translation tool viewpoint there is no reason to treat them differently from other format.
A string like "Set the parameters" in a table header will probably be translated differently from the same string as the first in a list of instructions at, let us say, table row level. Context, and structural context at that, is essential.
Especially when translating Help and software, a correct interpretation of the intended use of the string is crucial. In the "good old days", the format of the .C, .H, and .RC files would provide such a context, as would the .RTF files for the Help. Pouring all the strings together in an XML/XLIFF environment robs the localiser / translator of such essential information as where does one menu end and the next start (required to avoid duplicate mnemonics), is it a header or instruction (usually found out by looking at the print-out of the .RTF file), etc.
I agree with many of you that the uniform presentation of the strings to be translated may lead to more consistency and thus a higher quality in translation. The downside, I am afraid, is that Project Managers will be more busy than ever answering questions like 'Is this an instruction or a header?', etc.
Localising software in itself is already difficult enough; incorporating strings and Help text in one big XML / XLIFF corpus will at times be a nightmare.
I have been using XML-based corpuses for translation for well over a year now, and so far the lack of context (often because the source texts were devoid of a printable context) outweighs the advantages of uniformity and consistency. Even worse, the very absence of context may lead to more translation errors ...
I do recognise the vast advantages of both XML and XLIFF in their own specific ways (XML for processing and transferring data, and XLIFF for terminology management and maintenance), but combined use in a translation environment is, for the time being, not my favourite.
From Yves Savourel (RWS) [Original Message]
For some reason I can't get through on the xliff-comment. So I'm posting this here (xliff list).
Hello Nico,
Thanks for your comments, I'm glad you made them. The importance of context for the translator is an aspect developer of translation tools and processes have a tendency to forget.
I would go even one step further: Ideally the translator should have access to the final form of the data (the formatted page, the dialog box, etc.) as often the format of the source file itself may not be enough to get a correct context.
There are a few aspects of XLIFF or XLIFF-related process that may help with this issue:
- Some of the metadata like restype can hold a clue about the origin of the data (for example that it's a checkbox, or a radio button, vs a normal paragraph, and so forth).
- Grouping of items together (e.g. the controls of a dialog box, the hierarchy of a set of menus, etc.) can be maintained in XLIFF by using the <group> elements adequately (they can be nested if necessary).
- Some information about the context can also be carried along with the translation units through the <context-group> and <context> elements.
This said, I realize it is not enough to ensure you get the context you need because those aspects are mechanism that *could be used*, with no certainty that they will be.
To help in that regard, one of the aims the XLIFF TC has is to create "profiles", documents that describe the recommended way to represent a given format in XLIFF. This is one area where we can push for the use of the context-oriented mechanisms.
As you pointed out the lack of context goes beyond XLIFF: it happens more and more in the source format itself (properties files, XML documents, etc.). I think the long term effort is to "educate" the designer of source formats (e.g. creators of XML document types) in the importance of having information about the context in the source file itself. This should be part of the guidelines for designing internationalized schemas and DTDs.
I would also add that XLIFF may not necessarily be the format of choice to translate *all* formats. If tools support correctly a given format there is probably no needed to complicate the process by extracting to XLIFF and merging back. At the same time, the same given format may take advantage of being extracted to XLIFF for other tasks than translation (e.g. terminology extraction, validation, alignment, and many others), so we have to make sure XLIFF can support it.
In some circumstances XLIFF can actually help to bring more context. For example, I'm currently working on a tool for image/graphic localization and in this occurrence XLIFF is helping me to associate context information and the text to localize. I've attached an example: an XLIFF file that uses XSL to render pictures (it could the bitmaps of a UI) along with the text to localize. See the Trados screen shot, or just open the XLF file to see. Note the comment displayed for the last image. We could not do this directly with
HTML (the comments would be either not visible (<!-- --->) or seen as translatable). So I think in some case XLIFF could help for bring more context, while in some other cases we need to be careful not to loose a minimal information set the translator needs.From Brian (AOL) [Original Message]
Nico,
Ref> ... in most cases an XML / XLIFF environment strips away the translator's most vital aspect of his job, namely the context of the element / sentence / string to translate.
The issue of "what's the context for this string" is a very important issue.
It is a separate issue from how non-well-formed-XML is handled so I've changed the title of this thread.
I suspect that the XLIFF spec tries to address this with the <note> tag. http://www.oasis-open.org/committees/xliff/documents/xliff-specification.htm#note
An alternate I see it being done is to produce a document describing what the items are. This seems like a high effort method.
From Yves Savourel (RWS) [Original message]
Hi everyone,
I've posted this additional examples on Friday, but for some reasons it bounced back, so here is it again:
The attached file has two examples of extracted HTML, with the original code for context. The idea is to show how we could include the whole original data inside the XLIFF body. There are two approaches I can think of:
1) Using <context-group> and <context>
Snippet from Example_ContextElem.xlf:
...
<trans-unit id='1' restype='caption'>
<source xml:lang='en'>Title of the test page</source>
<context-group name='ctx'>
<context context-type='x-raw'><![CDATA[<html>
<head>
<meta http-equiv="Content-Language" content="en-us">
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
<title>]]></context>
</context-group>
</trans-unit>
<trans-unit id='2' restype='heading'>
<source xml:lang='en'><bpt id='1'><a name="topoffile"></bpt><ept id='1'></a></ept>Test Page</source>
<context-group name='ctx'>
<context context-type='x-raw'><![CDATA[</title>
</head>
<body>
<h1>]]></context>
</context-group>
</trans-unit>
...There are a few drawbacks here:
- Currently <context> doesn't have xml:space attribute, so we can't use it to indicate the white spaces in content of <context> should be preserved.
- The current order in <trans-unit> has <source> before <context-group> so we end up with the context showing after the extracted text, not a big deal since both elements are siblings.
- The context-type attribute has no pre-defined value to indicate such content in <context>.
- The name attribute in <context-group> has no real in this case use but is required.
The advantage is that we don't use a user-defined namespace.
2) Using a user-defined namespace
Snippet from Example_UserExtension.xlf:
<group>
<ext:code xml:space='preserve'><![CDATA[<html>
<head>
<meta http-equiv="Content-Language" content="en-us">
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
<title>]]></ext:code>
<trans-unit id='1' restype='caption'>
<source xml:lang='en'>Title of the test page</source>
</trans-unit>
</group>
<group>
<ext:code xml:space='preserve'><![CDATA[</title>
</head>
<body>
<h1>]]></ext:code>
<trans-unit id='2' restype='heading'>
<source xml:lang='en'><bpt id='1'><a name="topoffile"></bpt><ept
id='1'></a></ept>Test Page</source>
</trans-unit>
</group>
...This approach is more efficient since you can have whatever you need in the new namespace (i.e. xml:space). It also allows to have a single context for multiple <trans-unit> (useful when there are sub-flows like in <trans-unit> #18 and #19).
The drawback is that it's user-defined.
I can see some scenarios where having such information within the XLIFF body could be useful. This is very similar to the structure of a Trados TTX file, except you get all the additional features a <trans-unit> can have (<alt-trans>, etc.).
From Yves Savourel (RWS) [Original Message]
Hi,
Working on some XLIFF example I ran into a small problem in our schema:
I noticed that the value type of our different id are sometimes 'string' sometime 'NMTOKEN'. So, for example, because of the '\' the following id value is not valid:
<bin-unit id='work\image1.jpg' mime-type='image/jpeg'>...
This wouldn't be too bad if we could escape the backslash, but even then, it's not easy. The only ASCII non-letter and non-digits allowed are: '.', '_', ':', and '-'. We can't use '$', '&', '%', '/', etc. Which makes potentially a lot of characters to escape.
The id value type for the id attributes in <x>, <bpt>, etc. are 'string', which allow pretty much anything. So I have two questions?
- Shouldn't all id attribute have the same value type?
- If yes, should this type be string or NMTOKEN or something else?
From Enda McDonnell (Alchemy Software) [Original Message]
Hi Yves,
I agree with your suggestion - I think all id attributes should have the same type. This brings some consistency for parsers reading xliff elements.
From Enda McDonnell (Alchemy Software) [Original Message]
NMTOKEN or string?
From a Catalyst perspective we read in all id values as strings. We then check if the string value is a number, and if so, use the numeric value, alternatively we hash up the string into a numeric value for identification purposes.
It is likely that all parsers at this stage read the xml attributes as strings unless they have used special code generation products which create classes based on a schema.
So, I think the most flexible is to allow the id be a string.
I have a question though regarding your sample... Would you not use the resname attribute for the text identifier, while the id is used as an internally unique id in the xliff file?
<bin-unit id='1032' resname='work\image1.jpg' mime-type='image/jpeg'>
From Yves Savourel (RWS) [Original Message]
Hi Enda,
Ref> I have a question though regarding your sample... Would you not use the resname attribute for the text identifier, while the id is used as an internally unique id in the xliff file?
<bin-unit id='1032' resname='work\image1.jpg' mime-type='image/jpeg'>Yes, the resname should probably be there too and use that value. But the id is driven by the tool, and it is not necessarily a numeric ID unique within the XLIFF document (you could have several <file> elements using identical IDs in a <xliff> element).
From the specification: "The id attribute is used in many elements as a reference to the original corresponding code data or format for the given element. The value of the id element is determined by the tool creating the XLIFF document. It may or may not be a resource identifier. The identifier of a resource should, at least, be stored in the resname attribute".
It does make sense for the id to be a unique value (number/text) within each <file> element of the document), but I don't know any way to enforce this through DTD or schema.
From Bryan Schnabel (Tektronix) [Original Message]
Hi Yves,
XML Schema does provide a way to enforce uniqueness within each <file> element, while allowing repeated values within the xliff file (http://www.w3.org/TR/xmlschema-0/#specifyingUniqueness). But I think you mean assigning an attribute named "id" the simpleType "ID" prevents assigning uniqueness in a certain scope (i.e., within each <file> element.)
I agree that DTD and Schema require the type ID to be unique within an entire file.
Traditionally attributes named "id" have had the type "ID." The way the XLIFF specification defines "id" though, it seems most appropriate for its type to be string, from my point of view, rather than NMTOKEN.
From John Corrigan (Sun) [Original Message]
Hi all,
I have had a look at the XLIFF 1.1 committee spec. and it looks rather good.
I have one comment to make. In the description for <note> it says: "The content of <note> may be instructions from developers about how to handle the <source>, comments from the translator about the translation, or any comment from anyone involved in processing the XLIFF file."
However in <trans-unit> elements, the only way to determine whether a <note> refers to the <source> or the <target> is to rely on some convention for the 'from' attribute. I don't think this is how the 'from' attribute was intended to be used. Would it be possible to add another attribute to the note element to allow users indicate if a <note> refers to <source> or <target> elements?
This attribute should be optional, with a default value that indicates that the <note> doesn't annotate anything in particular. May I suggest calling the attribute 'annotates', with values drawn from, "source", "target", and "general". The value "general" would be the implied default.
P.S. Are there any plans to produce a standardized skeleton file format in future releases?
From John Reid (Novell) [Original Message]
Hi John,
It's good to hear from you, again.
Ref>... However in <trans-unit> elements, the only way to determine whether a <note> refers to the <source> or the <target> is to rely on some convention for the 'from' attribute.
Please, can you elaborate on how you see this being utilized? The 1.1 spec says, "All child elements of <trans-unit> pertain to their sibling <source> element" Thus, a strict reading of this says any <note> in a <trans-unit> pertains to the <source>. Unless there is consideration to programmatically differentiate between notes relating to the <source> or <target>, exclusive of the other, there seems to be little advantage of adding the attribute.
Ref> Are there any plans to produce a standardized skeleton file format in future releases?
There is a usefulness, if not a necessity, for this in the context of the profiles we wish to specify. In other words, a standardization of the skeleton file for files of the same file type, makes perfect sense. However, a standard skeleton file format could be impossible across file types. For example, the skeleton files generated for a DLL file must be remarkably different from those produced for an HTML file. However, it would be advantageous to create a standard skeleton format for HTML, another for DLLs, and others for those formats we specify profiles for.
From Gérard Cattin des Bois (Microsoft) [Original message]
That makes sense John and Hi to John (1 and 2).
-end-