xliff-comment message

Subject: Re: [xliff] Fwd: Handling escaped characters in Translation Units
From: Asgeir Frimannsson <asgeirf@redhat.com>
To: xliff-comment@lists.oasis-open.org
Date: Fri, 27 May 2005 14:04:32 +1000
Hi Doug,

(cc'ing xliff-comment as I can't post to the xliff list)

On Tue, 24 May 2005 05:14, Doug Domeny wrote:
> Paul,
>
> You've raised a good question. As you've discovered, XLIFF is flexible in
> its representation of escaped characters. I don't believe there's one
> "right way" or even necessarily a "best way" to convert source to XLIFF. To
> let you know my perspective, most of the XLIFF work I've done is with XHTML
> and JavaScript rather than MS C++ or Java. The committee's HTML profile
> document (currently in draft), in fact, describes two different approaches
> to process HTML to XLIFF. Nevertheless, please consider the following
> principles:
>
> (1) XLIFF text should not include computer language-specific encoding.
>
> Ideally, in my opinion, XLIFF source text should be independent of the
> computer language from which it was extracted. For example, "&mdash;" and
> "&copy;" are HTML entities that should be converted to their equivalent
> binary value, U+0097 and U+00A9 respectively, or XML character references,
> "&#151;" and "&#169;" respectively. Similarly, \uNNNN and \xNN escaped
> characters in C-like languages should be converted to their binary value or
> XML character reference, with the binary value being preferred.
> Line-breaks, \n and \r\n, however, should be converted to an XLIFF tag with
> ctype="lb", as in <x id="1" ctype="lb" />.
>
> (2) Use standard XLIFF types, when available, rather than custom "x-"
> types. For example, use ctype="lb" for a line-break.
>
> 	<x id="1" ctype="lb" />

I agree with you approach here, - except for handling of escaped white space 
characters. I would rather make a generalization that when 'xml:space' is set 
to 'preserve' in a translation unit, it is less useful to use ctype='lb' for 
representing line breaks, except for abstracting <br/> style elements. 

Similarly, in the HTML repr. Guide, we don't convert real newlines to <x 
ctype='lb'/> in <pre>, <textarea> and 'xml=preserve'
elements, even though we have an XLIFF element suitable for this. 

There is however a difference if the source format has a concept of newlines 
in addition to the escaped newlines, but most software formats does not have 
this. 

But as you say, maybe there is no right or wrong way here - I'm just trying to 
find the solution which would be easiest and most user friendly for 
translators, and at the same time feasible for XML processing...

Well, just my two cents :)

cheers,
asgeir

> Regards,
>
> Doug Domeny
>
> -----Original Message-----
> From: Paul Gampe [mailto:pgampe@redhat.com]
> Sent: Sunday, May 22, 2005 10:29 PM
> To: xliff@lists.oasis-open.org
> Subject: [xliff] Fwd: Handling escaped characters in Translation Units
>
> Dear TC, the xliff-tools project, would greatly appreciate your insight on
> the
> following problem they have been discussing:
>
> ----------  Forwarded Message  ----------
>
> Subject: Handling escaped characters in Translation Units
> Date: Tuesday 10 May 2005 16:52
> From: Asgeir Frimannsson <asgeirf@redhat.com>
> To: Paul Gampe <pgampe@redhat.com>
> Cc: Jim Hogan <j.hogan@qut.edu.au>
>
> Hi Paul,
>
> Here's an issue we've been discussing up and down on the xliff-tools
>  mailing-list, - a discussion initiated by Yves Savourel last week. I
> believe
>  this is an issue that needs a reccommended approach by the XLIFF TC. Let
> me know what you think :)
>
> Handling Escaped Characters in Translation Units
>
> In source code, it is very common to use escape characters for characters
>  like newline (\u000A) and horizontal tab (\u0009).
>
> For example:
>
> printf("Please Enter the following Data:\n\
> \t- First Name\n\
> \t- Last Name\n");
>
> Here we've used the escape characters '\n' and '\t' representing newlines
> and
>  tabs.
>
> This fragment would be represented in PO as follows:
>
> msgid ""
> "Please Enter the following Data:\n"
> "\t- First Name\n"
> "\t- Last Name\n"
>
>
> This could be mapped to XLIFF using two different approaches:
>
> Approach A:
>
> We could preserve the escaped characters:
>
> <source>Please Enter the following Data:\n\t- First Name\n\
> \t- First Name\n\t- Last Name\n</source>
>
> We could further enhance this by abstracting the escaped characters to <ph>
>  elements:
>
> <source>Please Enter the following Data:<ph id='1' ctype='lb'>\n</ph>\
> <ph id='2' ctype='x-ht'>\t</ph>- First Name<ph id='3' ctype='lb'>\n</ph>\
> <ph id='4' ctype='x-ht'>\t</ph>- First Name<ph id='5' ctype='lb'>\n</ph>\
> </source>
>
> Issue A-1: If using this approach, would filters have to discard real
> newline
>  characters (\u000A) in translation units? How would this affect TM
> lookups?
>
> Issue A-2: How would editors handle this approach? For software messages,
>  they would have to disable entering newlines, and in some way format the
>  message after the value of the ctype attributes? (Not having visual
>  indicators for e.g. newlines would not be a very
>  translator-useability-friendly approach).
>
> Issue A-3: Where do we stop? In Java .properties files we usually add a
>  "\u0020" to indicate a leading space, For example:
>
> my_message = \u0020Some Text
>
> Should this be represented as:
>
> <source>\u0020Some Text</source>
> or
> <source> Some Text</source>
> ?
>
> Approach B:
>
> Many of the escaped characters have native unicode values we could use in
>  XLIFF. We could replace '\t' with a real TAB (\u0009) character, and
> similar
>  with other escape characters, giving us the following XLIFF fragment:
>
> <source>Please Enter the following Data:
> 	- First Name
> 	- Last Name
> </source>
>
> Issue B-1: DOS/Windows use "\r\n", while UNIX (and most programming
>  languages) use "\n" as line endings. How would we on back-conversion know
> if
>  we should write "\n" or "\r\n" in the translated source file.
>
> Issue B-2: There are some escape characters used in PO (and probably other
>  source formats?) that XML does not allow. For example the "\b" (\u0007,
> the Alert or Bell control character). How should these be handled? (Yes,
> asking the developer what that character is doing in a localised message is
> a good start)
>
> Conclusion
>
> It would be good to have a reccommended approach for handling this, which
> all
>  representation guides could share.
>
> The full archived discussion on this, is available at:
> http://lists.freedesktop.org/archives/xliff-tools/2005-May/000169.html
>
> cheers,
> asgeir
>
> -------------------------------------------------------
>
> ---------------------------------------------------------------------
> To unsubscribe from this mail list, you must leave the OASIS TC that
> generates this mail.  You may a link to this group and all your TCs in
> OASIS at:
> https://www.oasis-open.org/apps/org/workgroup/portal/my_workgroups.php