[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: Re: [xliff] Fwd: Handling escaped characters in Translation Units
Hi Doug, (cc'ing xliff-comment as I can't post to the xliff list) On Tue, 24 May 2005 05:14, Doug Domeny wrote: > Paul, > > You've raised a good question. As you've discovered, XLIFF is flexible in > its representation of escaped characters. I don't believe there's one > "right way" or even necessarily a "best way" to convert source to XLIFF. To > let you know my perspective, most of the XLIFF work I've done is with XHTML > and JavaScript rather than MS C++ or Java. The committee's HTML profile > document (currently in draft), in fact, describes two different approaches > to process HTML to XLIFF. Nevertheless, please consider the following > principles: > > (1) XLIFF text should not include computer language-specific encoding. > > Ideally, in my opinion, XLIFF source text should be independent of the > computer language from which it was extracted. For example, "—" and > "©" are HTML entities that should be converted to their equivalent > binary value, U+0097 and U+00A9 respectively, or XML character references, > "—" and "©" respectively. Similarly, \uNNNN and \xNN escaped > characters in C-like languages should be converted to their binary value or > XML character reference, with the binary value being preferred. > Line-breaks, \n and \r\n, however, should be converted to an XLIFF tag with > ctype="lb", as in <x id="1" ctype="lb" />. > > (2) Use standard XLIFF types, when available, rather than custom "x-" > types. For example, use ctype="lb" for a line-break. > > <x id="1" ctype="lb" /> I agree with you approach here, - except for handling of escaped white space characters. I would rather make a generalization that when 'xml:space' is set to 'preserve' in a translation unit, it is less useful to use ctype='lb' for representing line breaks, except for abstracting <br/> style elements. Similarly, in the HTML repr. Guide, we don't convert real newlines to <x ctype='lb'/> in <pre>, <textarea> and 'xml=preserve' elements, even though we have an XLIFF element suitable for this. There is however a difference if the source format has a concept of newlines in addition to the escaped newlines, but most software formats does not have this. But as you say, maybe there is no right or wrong way here - I'm just trying to find the solution which would be easiest and most user friendly for translators, and at the same time feasible for XML processing... Well, just my two cents :) cheers, asgeir > Regards, > > Doug Domeny > > -----Original Message----- > From: Paul Gampe [mailto:pgampe@redhat.com] > Sent: Sunday, May 22, 2005 10:29 PM > To: xliff@lists.oasis-open.org > Subject: [xliff] Fwd: Handling escaped characters in Translation Units > > Dear TC, the xliff-tools project, would greatly appreciate your insight on > the > following problem they have been discussing: > > ---------- Forwarded Message ---------- > > Subject: Handling escaped characters in Translation Units > Date: Tuesday 10 May 2005 16:52 > From: Asgeir Frimannsson <asgeirf@redhat.com> > To: Paul Gampe <pgampe@redhat.com> > Cc: Jim Hogan <j.hogan@qut.edu.au> > > Hi Paul, > > Here's an issue we've been discussing up and down on the xliff-tools > mailing-list, - a discussion initiated by Yves Savourel last week. I > believe > this is an issue that needs a reccommended approach by the XLIFF TC. Let > me know what you think :) > > Handling Escaped Characters in Translation Units > > In source code, it is very common to use escape characters for characters > like newline (\u000A) and horizontal tab (\u0009). > > For example: > > printf("Please Enter the following Data:\n\ > \t- First Name\n\ > \t- Last Name\n"); > > Here we've used the escape characters '\n' and '\t' representing newlines > and > tabs. > > This fragment would be represented in PO as follows: > > msgid "" > "Please Enter the following Data:\n" > "\t- First Name\n" > "\t- Last Name\n" > > > This could be mapped to XLIFF using two different approaches: > > Approach A: > > We could preserve the escaped characters: > > <source>Please Enter the following Data:\n\t- First Name\n\ > \t- First Name\n\t- Last Name\n</source> > > We could further enhance this by abstracting the escaped characters to <ph> > elements: > > <source>Please Enter the following Data:<ph id='1' ctype='lb'>\n</ph>\ > <ph id='2' ctype='x-ht'>\t</ph>- First Name<ph id='3' ctype='lb'>\n</ph>\ > <ph id='4' ctype='x-ht'>\t</ph>- First Name<ph id='5' ctype='lb'>\n</ph>\ > </source> > > Issue A-1: If using this approach, would filters have to discard real > newline > characters (\u000A) in translation units? How would this affect TM > lookups? > > Issue A-2: How would editors handle this approach? For software messages, > they would have to disable entering newlines, and in some way format the > message after the value of the ctype attributes? (Not having visual > indicators for e.g. newlines would not be a very > translator-useability-friendly approach). > > Issue A-3: Where do we stop? In Java .properties files we usually add a > "\u0020" to indicate a leading space, For example: > > my_message = \u0020Some Text > > Should this be represented as: > > <source>\u0020Some Text</source> > or > <source> Some Text</source> > ? > > Approach B: > > Many of the escaped characters have native unicode values we could use in > XLIFF. We could replace '\t' with a real TAB (\u0009) character, and > similar > with other escape characters, giving us the following XLIFF fragment: > > <source>Please Enter the following Data: > - First Name > - Last Name > </source> > > Issue B-1: DOS/Windows use "\r\n", while UNIX (and most programming > languages) use "\n" as line endings. How would we on back-conversion know > if > we should write "\n" or "\r\n" in the translated source file. > > Issue B-2: There are some escape characters used in PO (and probably other > source formats?) that XML does not allow. For example the "\b" (\u0007, > the Alert or Bell control character). How should these be handled? (Yes, > asking the developer what that character is doing in a localised message is > a good start) > > Conclusion > > It would be good to have a reccommended approach for handling this, which > all > representation guides could share. > > The full archived discussion on this, is available at: > http://lists.freedesktop.org/archives/xliff-tools/2005-May/000169.html > > cheers, > asgeir > > ------------------------------------------------------- > > --------------------------------------------------------------------- > To unsubscribe from this mail list, you must leave the OASIS TC that > generates this mail. You may a link to this group and all your TCs in > OASIS at: > https://www.oasis-open.org/apps/org/workgroup/portal/my_workgroups.php
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]