xliff message

Subject: RE: [xliff] Fwd: Handling escaped characters in Translation Units
From: "Doug Domeny" <ddomeny@ektron.com>
To: "'Paul Gampe'" <pgampe@redhat.com>,<xliff@lists.oasis-open.org>
Date: Mon, 23 May 2005 15:14:20 -0400
Paul,

You've raised a good question. As you've discovered, XLIFF is flexible in
its representation of escaped characters. I don't believe there's one "right
way" or even necessarily a "best way" to convert source to XLIFF. To let you
know my perspective, most of the XLIFF work I've done is with XHTML and
JavaScript rather than MS C++ or Java. The committee's HTML profile document
(currently in draft), in fact, describes two different approaches to process
HTML to XLIFF. Nevertheless, please consider the following principles:

(1) XLIFF text should not include computer language-specific encoding.

Ideally, in my opinion, XLIFF source text should be independent of the
computer language from which it was extracted. For example, "&mdash;" and
"&copy;" are HTML entities that should be converted to their equivalent
binary value, U+0097 and U+00A9 respectively, or XML character references,
"&#151;" and "&#169;" respectively. Similarly, \uNNNN and \xNN escaped
characters in C-like languages should be converted to their binary value or
XML character reference, with the binary value being preferred. Line-breaks,
\n and \r\n, however, should be converted to an XLIFF tag with ctype="lb",
as in <x id="1" ctype="lb" />.

(2) Use standard XLIFF types, when available, rather than custom "x-" types.
For example, use ctype="lb" for a line-break.

	<x id="1" ctype="lb" />

Regards,
 
Doug Domeny

-----Original Message-----
From: Paul Gampe [mailto:pgampe@redhat.com] 
Sent: Sunday, May 22, 2005 10:29 PM
To: xliff@lists.oasis-open.org
Subject: [xliff] Fwd: Handling escaped characters in Translation Units

Dear TC, the xliff-tools project, would greatly appreciate your insight on
the 
following problem they have been discussing:

----------  Forwarded Message  ----------

Subject: Handling escaped characters in Translation Units
Date: Tuesday 10 May 2005 16:52
From: Asgeir Frimannsson <asgeirf@redhat.com>
To: Paul Gampe <pgampe@redhat.com>
Cc: Jim Hogan <j.hogan@qut.edu.au>

Hi Paul,

Here's an issue we've been discussing up and down on the xliff-tools
 mailing-list, - a discussion initiated by Yves Savourel last week. I
believe
 this is an issue that needs a reccommended approach by the XLIFF TC. Let me
 know what you think :)

Handling Escaped Characters in Translation Units

In source code, it is very common to use escape characters for characters
 like newline (\u000A) and horizontal tab (\u0009).

For example:

printf("Please Enter the following Data:\n\
\t- First Name\n\
\t- Last Name\n");

Here we've used the escape characters '\n' and '\t' representing newlines
and
 tabs.

This fragment would be represented in PO as follows:

msgid ""
"Please Enter the following Data:\n"
"\t- First Name\n"
"\t- Last Name\n"


This could be mapped to XLIFF using two different approaches:

Approach A:

We could preserve the escaped characters:

<source>Please Enter the following Data:\n\t- First Name\n\
\t- First Name\n\t- Last Name\n</source>

We could further enhance this by abstracting the escaped characters to <ph>
 elements:

<source>Please Enter the following Data:<ph id='1' ctype='lb'>\n</ph>\
<ph id='2' ctype='x-ht'>\t</ph>- First Name<ph id='3' ctype='lb'>\n</ph>\
<ph id='4' ctype='x-ht'>\t</ph>- First Name<ph id='5' ctype='lb'>\n</ph>\
</source>

Issue A-1: If using this approach, would filters have to discard real
newline
 characters (\u000A) in translation units? How would this affect TM lookups?

Issue A-2: How would editors handle this approach? For software messages,
 they would have to disable entering newlines, and in some way format the
 message after the value of the ctype attributes? (Not having visual
 indicators for e.g. newlines would not be a very
 translator-useability-friendly approach).

Issue A-3: Where do we stop? In Java .properties files we usually add a
 "\u0020" to indicate a leading space, For example:

my_message = \u0020Some Text

Should this be represented as:

<source>\u0020Some Text</source>
or
<source> Some Text</source>
?

Approach B:

Many of the escaped characters have native unicode values we could use in
 XLIFF. We could replace '\t' with a real TAB (\u0009) character, and
similar
 with other escape characters, giving us the following XLIFF fragment:

<source>Please Enter the following Data:
	- First Name
	- Last Name
</source>

Issue B-1: DOS/Windows use "\r\n", while UNIX (and most programming
 languages) use "\n" as line endings. How would we on back-conversion know
if
 we should write "\n" or "\r\n" in the translated source file.

Issue B-2: There are some escape characters used in PO (and probably other
 source formats?) that XML does not allow. For example the "\b" (\u0007, the
 Alert or Bell control character). How should these be handled? (Yes, asking
 the developer what that character is doing in a localised message is a good
 start)

Conclusion

It would be good to have a reccommended approach for handling this, which
all
 representation guides could share.

The full archived discussion on this, is available at:
http://lists.freedesktop.org/archives/xliff-tools/2005-May/000169.html

cheers,
asgeir

-------------------------------------------------------

---------------------------------------------------------------------
To unsubscribe from this mail list, you must leave the OASIS TC that
generates this mail.  You may a link to this group and all your TCs in OASIS
at:
https://www.oasis-open.org/apps/org/workgroup/portal/my_workgroups.php
References:
- Fwd: Handling escaped characters in Translation Units
  - From: Paul Gampe <pgampe@redhat.com>