[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: RE: [xliff] Fwd: Handling escaped characters in Translation Units
Paul, You've raised a good question. As you've discovered, XLIFF is flexible in its representation of escaped characters. I don't believe there's one "right way" or even necessarily a "best way" to convert source to XLIFF. To let you know my perspective, most of the XLIFF work I've done is with XHTML and JavaScript rather than MS C++ or Java. The committee's HTML profile document (currently in draft), in fact, describes two different approaches to process HTML to XLIFF. Nevertheless, please consider the following principles: (1) XLIFF text should not include computer language-specific encoding. Ideally, in my opinion, XLIFF source text should be independent of the computer language from which it was extracted. For example, "—" and "©" are HTML entities that should be converted to their equivalent binary value, U+0097 and U+00A9 respectively, or XML character references, "—" and "©" respectively. Similarly, \uNNNN and \xNN escaped characters in C-like languages should be converted to their binary value or XML character reference, with the binary value being preferred. Line-breaks, \n and \r\n, however, should be converted to an XLIFF tag with ctype="lb", as in <x id="1" ctype="lb" />. (2) Use standard XLIFF types, when available, rather than custom "x-" types. For example, use ctype="lb" for a line-break. <x id="1" ctype="lb" /> Regards, Doug Domeny -----Original Message----- From: Paul Gampe [mailto:pgampe@redhat.com] Sent: Sunday, May 22, 2005 10:29 PM To: xliff@lists.oasis-open.org Subject: [xliff] Fwd: Handling escaped characters in Translation Units Dear TC, the xliff-tools project, would greatly appreciate your insight on the following problem they have been discussing: ---------- Forwarded Message ---------- Subject: Handling escaped characters in Translation Units Date: Tuesday 10 May 2005 16:52 From: Asgeir Frimannsson <asgeirf@redhat.com> To: Paul Gampe <pgampe@redhat.com> Cc: Jim Hogan <j.hogan@qut.edu.au> Hi Paul, Here's an issue we've been discussing up and down on the xliff-tools mailing-list, - a discussion initiated by Yves Savourel last week. I believe this is an issue that needs a reccommended approach by the XLIFF TC. Let me know what you think :) Handling Escaped Characters in Translation Units In source code, it is very common to use escape characters for characters like newline (\u000A) and horizontal tab (\u0009). For example: printf("Please Enter the following Data:\n\ \t- First Name\n\ \t- Last Name\n"); Here we've used the escape characters '\n' and '\t' representing newlines and tabs. This fragment would be represented in PO as follows: msgid "" "Please Enter the following Data:\n" "\t- First Name\n" "\t- Last Name\n" This could be mapped to XLIFF using two different approaches: Approach A: We could preserve the escaped characters: <source>Please Enter the following Data:\n\t- First Name\n\ \t- First Name\n\t- Last Name\n</source> We could further enhance this by abstracting the escaped characters to <ph> elements: <source>Please Enter the following Data:<ph id='1' ctype='lb'>\n</ph>\ <ph id='2' ctype='x-ht'>\t</ph>- First Name<ph id='3' ctype='lb'>\n</ph>\ <ph id='4' ctype='x-ht'>\t</ph>- First Name<ph id='5' ctype='lb'>\n</ph>\ </source> Issue A-1: If using this approach, would filters have to discard real newline characters (\u000A) in translation units? How would this affect TM lookups? Issue A-2: How would editors handle this approach? For software messages, they would have to disable entering newlines, and in some way format the message after the value of the ctype attributes? (Not having visual indicators for e.g. newlines would not be a very translator-useability-friendly approach). Issue A-3: Where do we stop? In Java .properties files we usually add a "\u0020" to indicate a leading space, For example: my_message = \u0020Some Text Should this be represented as: <source>\u0020Some Text</source> or <source> Some Text</source> ? Approach B: Many of the escaped characters have native unicode values we could use in XLIFF. We could replace '\t' with a real TAB (\u0009) character, and similar with other escape characters, giving us the following XLIFF fragment: <source>Please Enter the following Data: - First Name - Last Name </source> Issue B-1: DOS/Windows use "\r\n", while UNIX (and most programming languages) use "\n" as line endings. How would we on back-conversion know if we should write "\n" or "\r\n" in the translated source file. Issue B-2: There are some escape characters used in PO (and probably other source formats?) that XML does not allow. For example the "\b" (\u0007, the Alert or Bell control character). How should these be handled? (Yes, asking the developer what that character is doing in a localised message is a good start) Conclusion It would be good to have a reccommended approach for handling this, which all representation guides could share. The full archived discussion on this, is available at: http://lists.freedesktop.org/archives/xliff-tools/2005-May/000169.html cheers, asgeir ------------------------------------------------------- --------------------------------------------------------------------- To unsubscribe from this mail list, you must leave the OASIS TC that generates this mail. You may a link to this group and all your TCs in OASIS at: https://www.oasis-open.org/apps/org/workgroup/portal/my_workgroups.php
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]