[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: Re: Preferred method of representing invalid XML chars in <source>?
Hi Doug, I just realised I made two mistakes in my mail. The first mistake was forgetting that U+000D, U+000A and U+0009 are the only valid XML characters with codepoints below U+0020. If you supply any other codepoint (e.g. ) your XML parser should report an error. { from the XML spec, §2.2 "Characters" } [2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */ -- end quote -- (that's their comment in there, and it is technically incorrect, as Unicode recognises codepoints below U+0020 to be "characters") The second mistake was getting the codepoint for TAB wrong (a perennial "misspelling" of mine) :-) My problem is that I have source data with "unusual" codepoints. Particularly U+0000, and U+0010 to U+001F (valid Unicode, but not valid in XML). -- Kristian On 26 Jul 2004, at 14:08, Doug wrote: > Kristian Walsh, > > I've not heard of invalid characters in an XML document. Usually, > non-printable characters are represented by their character reference. > I > frequently use ' ' to represent CR/LF. > > For example, > > <trans-unit id="a920cf"> > <source xml:lang="en">Three tabs follow then the text > continues</source> > </trans-unit> > > Regards, > > Doug Domeny > Software Analyst > > Ektron, Inc. > +1 603 594-0249 x212 > http://www.ektron.com > > > > -----Original Message----- > From: Kristian Walsh [mailto:listreader@byteform.com] > Sent: Monday, July 26, 2004 6:28 AM > To: xliff-comment@lists.oasis-open.org > Subject: Preferred method of representing invalid XML chars in > <source>? > > > Hi, > > I am developing an application which creates XLIFF 1.0 documents from > source data. Unfortunately, sometimes this source data contains > character codes below U+0020, which are invalid in an XML document. > > I am unsure of the "canonical" way to deal with this in XLIFF 1.0 > (version 1.1 is not an option for this application); as far as I can > see, <x/>, <g/> and <ph> can all be used for this purpose, as below: > > > Form 1: <x> > > <trans-unit id="a920cf"> > <source xml:lang="en">Three tabs follow<x id="a920d0" > ctype="character" clone="yes" ts="MyTool:chars=0008,0008,0008"> then > the text continues</source> > </trans-unit> > > > Form 2: <g> > > <trans-unit id="a920cf"> > <source xml:lang="en">Three tabs follow<g id="a920d0" > ctype="character" clone="yes" ts="MyTool:chars">0008,0008,0008</g> then > the text continues</source> > </trans-unit> > > > Form 3: <ph> > > <trans-unit id="a920cf"> > <source xml:lang="en">Three tabs follow<ph id="a920d0" > ctype="character" ts="MyTool:chars">0008,0008,0008</ph> then the text > continues</source> > </trans-unit> > > > So my two questions are: > > 1. Which of the above forms is preferred in XLIFF 1.0 for > representing > non-XML characters inside source (and/or target) data? > > 2. Is there a standard ctype attribute value for "raw character > codes"? > > Any ideas would be greatly appreciated, > -- > Kristian >
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]