[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: RE: Preferred method of representing invalid XML chars in <source>?
Kristian, Thanks for the quote from the XML spec. I learned something new today! :-) Regarding representing the invalid XML chars, I recommend using only attributes and not text (PCDATA) on the principle that the XML document should be readable as text only. That is, if all the tags were removed, the remaining text should make sense. It's a philosophy of choosing between using attributes or elements. A possible exception to this would be if the character had an alternate representation in the native computer language (e.g., C or Java). For example, <ph id="a920d0">\x10</ph>. But this may not adequately hide the character code from the translator. The choice between <g> and <ph> is in the XLIFF 1.1 spec (and is the same for XLIFF 1.0). "This element [<source>] may contain inline elements that either remove the codes from the source (<g>, <x/>, <bx/>, <ex/>) or that mask off codes left inline (<bpt>, <ept>, <sub>, <it>, <ph>)." http://www.oasis-open.org/committees/xliff/documents/cs-xliff-core-1.1-20031 031.htm#Struct_Body User-defined 'ctype' values must start with "x-". Additionally, I recommend using a placeholder tag for each individual character rather than a comma-separated list of values. <trans-unit id="a920cf"> <source xml:lang="en">Three char 8's follow<x id="a920d0" ctype="x-chr8" clone="yes"/><x id="a920d1" ctype="x-chr8" clone="yes"/><x id="a920d2" ctype="x-chr8" clone="yes"/> then the text continues</source> </trans-unit> Unless, that is, the code sequence is critical, in which case, you may wish to represent it as a separate custom 'ctype'. <trans-unit id="a920cf"> <source xml:lang="en">Special sequence follows<x id="a920d0" ctype="x-spcseq" clone="yes"/> then the text continues</source> </trans-unit> Regards, Doug Domeny Software Analyst Ektron, Inc. +1 603 594-0249 x212 http://www.ektron.com -----Original Message----- From: Kristian Walsh [mailto:listreader@byteform.com] Sent: Monday, July 26, 2004 10:14 AM To: xliff-comment@lists.oasis-open.org Cc: doug@ektron.com Subject: Re: Preferred method of representing invalid XML chars in <source>? Hi Doug, I just realised I made two mistakes in my mail. The first mistake was forgetting that U+000D, U+000A and U+0009 are the only valid XML characters with codepoints below U+0020. If you supply any other codepoint (e.g. ) your XML parser should report an error. { from the XML spec, §2.2 "Characters" } [2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */ -- end quote -- (that's their comment in there, and it is technically incorrect, as Unicode recognises codepoints below U+0020 to be "characters") The second mistake was getting the codepoint for TAB wrong (a perennial "misspelling" of mine) :-) My problem is that I have source data with "unusual" codepoints. Particularly U+0000, and U+0010 to U+001F (valid Unicode, but not valid in XML). -- Kristian On 26 Jul 2004, at 14:08, Doug wrote: > Kristian Walsh, > > I've not heard of invalid characters in an XML document. Usually, > non-printable characters are represented by their character reference. > I > frequently use ' ' to represent CR/LF. > > For example, > > <trans-unit id="a920cf"> > <source xml:lang="en">Three tabs follow then the text > continues</source> > </trans-unit> > > Regards, > > Doug Domeny > Software Analyst > > Ektron, Inc. > +1 603 594-0249 x212 > http://www.ektron.com > > > > -----Original Message----- > From: Kristian Walsh [mailto:listreader@byteform.com] > Sent: Monday, July 26, 2004 6:28 AM > To: xliff-comment@lists.oasis-open.org > Subject: Preferred method of representing invalid XML chars in > <source>? > > > Hi, > > I am developing an application which creates XLIFF 1.0 documents from > source data. Unfortunately, sometimes this source data contains > character codes below U+0020, which are invalid in an XML document. > > I am unsure of the "canonical" way to deal with this in XLIFF 1.0 > (version 1.1 is not an option for this application); as far as I can > see, <x/>, <g/> and <ph> can all be used for this purpose, as below: > > > Form 1: <x> > > <trans-unit id="a920cf"> > <source xml:lang="en">Three tabs follow<x id="a920d0" > ctype="character" clone="yes" ts="MyTool:chars=0008,0008,0008"> then > the text continues</source> > </trans-unit> > > > Form 2: <g> > > <trans-unit id="a920cf"> > <source xml:lang="en">Three tabs follow<g id="a920d0" > ctype="character" clone="yes" ts="MyTool:chars">0008,0008,0008</g> then > the text continues</source> > </trans-unit> > > > Form 3: <ph> > > <trans-unit id="a920cf"> > <source xml:lang="en">Three tabs follow<ph id="a920d0" > ctype="character" ts="MyTool:chars">0008,0008,0008</ph> then the text > continues</source> > </trans-unit> > > > So my two questions are: > > 1. Which of the above forms is preferred in XLIFF 1.0 for > representing > non-XML characters inside source (and/or target) data? > > 2. Is there a standard ctype attribute value for "raw character > codes"? > > Any ideas would be greatly appreciated, > -- > Kristian >
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]