[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: Re: [xliff] Fwd: Handling escaped characters in Translation Units
Hi Doug,
(cc'ing xliff-comment as I can't post to the xliff list)
On Tue, 24 May 2005 05:14, Doug Domeny wrote:
> Paul,
>
> You've raised a good question. As you've discovered, XLIFF is flexible in
> its representation of escaped characters. I don't believe there's one
> "right way" or even necessarily a "best way" to convert source to XLIFF. To
> let you know my perspective, most of the XLIFF work I've done is with XHTML
> and JavaScript rather than MS C++ or Java. The committee's HTML profile
> document (currently in draft), in fact, describes two different approaches
> to process HTML to XLIFF. Nevertheless, please consider the following
> principles:
>
> (1) XLIFF text should not include computer language-specific encoding.
>
> Ideally, in my opinion, XLIFF source text should be independent of the
> computer language from which it was extracted. For example, "—" and
> "©" are HTML entities that should be converted to their equivalent
> binary value, U+0097 and U+00A9 respectively, or XML character references,
> "—" and "©" respectively. Similarly, \uNNNN and \xNN escaped
> characters in C-like languages should be converted to their binary value or
> XML character reference, with the binary value being preferred.
> Line-breaks, \n and \r\n, however, should be converted to an XLIFF tag with
> ctype="lb", as in <x id="1" ctype="lb" />.
>
> (2) Use standard XLIFF types, when available, rather than custom "x-"
> types. For example, use ctype="lb" for a line-break.
>
> <x id="1" ctype="lb" />
I agree with you approach here, - except for handling of escaped white space
characters. I would rather make a generalization that when 'xml:space' is set
to 'preserve' in a translation unit, it is less useful to use ctype='lb' for
representing line breaks, except for abstracting <br/> style elements.
Similarly, in the HTML repr. Guide, we don't convert real newlines to <x
ctype='lb'/> in <pre>, <textarea> and 'xml=preserve'
elements, even though we have an XLIFF element suitable for this.
There is however a difference if the source format has a concept of newlines
in addition to the escaped newlines, but most software formats does not have
this.
But as you say, maybe there is no right or wrong way here - I'm just trying to
find the solution which would be easiest and most user friendly for
translators, and at the same time feasible for XML processing...
Well, just my two cents :)
cheers,
asgeir
> Regards,
>
> Doug Domeny
>
> -----Original Message-----
> From: Paul Gampe [mailto:pgampe@redhat.com]
> Sent: Sunday, May 22, 2005 10:29 PM
> To: xliff@lists.oasis-open.org
> Subject: [xliff] Fwd: Handling escaped characters in Translation Units
>
> Dear TC, the xliff-tools project, would greatly appreciate your insight on
> the
> following problem they have been discussing:
>
> ---------- Forwarded Message ----------
>
> Subject: Handling escaped characters in Translation Units
> Date: Tuesday 10 May 2005 16:52
> From: Asgeir Frimannsson <asgeirf@redhat.com>
> To: Paul Gampe <pgampe@redhat.com>
> Cc: Jim Hogan <j.hogan@qut.edu.au>
>
> Hi Paul,
>
> Here's an issue we've been discussing up and down on the xliff-tools
> mailing-list, - a discussion initiated by Yves Savourel last week. I
> believe
> this is an issue that needs a reccommended approach by the XLIFF TC. Let
> me know what you think :)
>
> Handling Escaped Characters in Translation Units
>
> In source code, it is very common to use escape characters for characters
> like newline (\u000A) and horizontal tab (\u0009).
>
> For example:
>
> printf("Please Enter the following Data:\n\
> \t- First Name\n\
> \t- Last Name\n");
>
> Here we've used the escape characters '\n' and '\t' representing newlines
> and
> tabs.
>
> This fragment would be represented in PO as follows:
>
> msgid ""
> "Please Enter the following Data:\n"
> "\t- First Name\n"
> "\t- Last Name\n"
>
>
> This could be mapped to XLIFF using two different approaches:
>
> Approach A:
>
> We could preserve the escaped characters:
>
> <source>Please Enter the following Data:\n\t- First Name\n\
> \t- First Name\n\t- Last Name\n</source>
>
> We could further enhance this by abstracting the escaped characters to <ph>
> elements:
>
> <source>Please Enter the following Data:<ph id='1' ctype='lb'>\n</ph>\
> <ph id='2' ctype='x-ht'>\t</ph>- First Name<ph id='3' ctype='lb'>\n</ph>\
> <ph id='4' ctype='x-ht'>\t</ph>- First Name<ph id='5' ctype='lb'>\n</ph>\
> </source>
>
> Issue A-1: If using this approach, would filters have to discard real
> newline
> characters (\u000A) in translation units? How would this affect TM
> lookups?
>
> Issue A-2: How would editors handle this approach? For software messages,
> they would have to disable entering newlines, and in some way format the
> message after the value of the ctype attributes? (Not having visual
> indicators for e.g. newlines would not be a very
> translator-useability-friendly approach).
>
> Issue A-3: Where do we stop? In Java .properties files we usually add a
> "\u0020" to indicate a leading space, For example:
>
> my_message = \u0020Some Text
>
> Should this be represented as:
>
> <source>\u0020Some Text</source>
> or
> <source> Some Text</source>
> ?
>
> Approach B:
>
> Many of the escaped characters have native unicode values we could use in
> XLIFF. We could replace '\t' with a real TAB (\u0009) character, and
> similar
> with other escape characters, giving us the following XLIFF fragment:
>
> <source>Please Enter the following Data:
> - First Name
> - Last Name
> </source>
>
> Issue B-1: DOS/Windows use "\r\n", while UNIX (and most programming
> languages) use "\n" as line endings. How would we on back-conversion know
> if
> we should write "\n" or "\r\n" in the translated source file.
>
> Issue B-2: There are some escape characters used in PO (and probably other
> source formats?) that XML does not allow. For example the "\b" (\u0007,
> the Alert or Bell control character). How should these be handled? (Yes,
> asking the developer what that character is doing in a localised message is
> a good start)
>
> Conclusion
>
> It would be good to have a reccommended approach for handling this, which
> all
> representation guides could share.
>
> The full archived discussion on this, is available at:
> http://lists.freedesktop.org/archives/xliff-tools/2005-May/000169.html
>
> cheers,
> asgeir
>
> -------------------------------------------------------
>
> ---------------------------------------------------------------------
> To unsubscribe from this mail list, you must leave the OASIS TC that
> generates this mail. You may a link to this group and all your TCs in
> OASIS at:
> https://www.oasis-open.org/apps/org/workgroup/portal/my_workgroups.php
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]