[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: Re: Preferred method of representing invalid XML chars in <source>?
Hi Doug,
I just realised I made two mistakes in my mail.
The first mistake was forgetting that U+000D, U+000A and U+0009 are the
only valid XML characters with codepoints below U+0020.
If you supply any other codepoint (e.g. ) your XML parser should
report an error.
{ from the XML spec, §2.2 "Characters" }
[2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD]
| [#x10000-#x10FFFF]
/* any Unicode character, excluding the surrogate blocks, FFFE, and
FFFF. */
-- end quote --
(that's their comment in there, and it is technically incorrect, as
Unicode recognises codepoints below U+0020 to be "characters")
The second mistake was getting the codepoint for TAB wrong (a perennial
"misspelling" of mine) :-)
My problem is that I have source data with "unusual" codepoints.
Particularly U+0000, and U+0010 to U+001F (valid Unicode, but not valid
in XML).
--
Kristian
On 26 Jul 2004, at 14:08, Doug wrote:
> Kristian Walsh,
>
> I've not heard of invalid characters in an XML document. Usually,
> non-printable characters are represented by their character reference.
> I
> frequently use ' ' to represent CR/LF.
>
> For example,
>
> <trans-unit id="a920cf">
> <source xml:lang="en">Three tabs follow then the text
> continues</source>
> </trans-unit>
>
> Regards,
>
> Doug Domeny
> Software Analyst
>
> Ektron, Inc.
> +1 603 594-0249 x212
> http://www.ektron.com
>
>
>
> -----Original Message-----
> From: Kristian Walsh [mailto:listreader@byteform.com]
> Sent: Monday, July 26, 2004 6:28 AM
> To: xliff-comment@lists.oasis-open.org
> Subject: Preferred method of representing invalid XML chars in
> <source>?
>
>
> Hi,
>
> I am developing an application which creates XLIFF 1.0 documents from
> source data. Unfortunately, sometimes this source data contains
> character codes below U+0020, which are invalid in an XML document.
>
> I am unsure of the "canonical" way to deal with this in XLIFF 1.0
> (version 1.1 is not an option for this application); as far as I can
> see, <x/>, <g/> and <ph> can all be used for this purpose, as below:
>
>
> Form 1: <x>
>
> <trans-unit id="a920cf">
> <source xml:lang="en">Three tabs follow<x id="a920d0"
> ctype="character" clone="yes" ts="MyTool:chars=0008,0008,0008"> then
> the text continues</source>
> </trans-unit>
>
>
> Form 2: <g>
>
> <trans-unit id="a920cf">
> <source xml:lang="en">Three tabs follow<g id="a920d0"
> ctype="character" clone="yes" ts="MyTool:chars">0008,0008,0008</g> then
> the text continues</source>
> </trans-unit>
>
>
> Form 3: <ph>
>
> <trans-unit id="a920cf">
> <source xml:lang="en">Three tabs follow<ph id="a920d0"
> ctype="character" ts="MyTool:chars">0008,0008,0008</ph> then the text
> continues</source>
> </trans-unit>
>
>
> So my two questions are:
>
> 1. Which of the above forms is preferred in XLIFF 1.0 for
> representing
> non-XML characters inside source (and/or target) data?
>
> 2. Is there a standard ctype attribute value for "raw character
> codes"?
>
> Any ideas would be greatly appreciated,
> --
> Kristian
>
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]