xliff-comment message

Subject: Re: Preferred method of representing invalid XML chars in <source>?

From: Kristian Walsh <listreader@byteform.com>
To: xliff-comment@lists.oasis-open.org
Date: Mon, 26 Jul 2004 15:13:32 +0100

Hi Doug,

I just realised I made two mistakes in my mail.

The first mistake was forgetting that U+000D, U+000A and U+0009 are the 
only valid XML characters with codepoints below U+0020.
If you supply any other codepoint (e.g. &#07;) your XML parser should 
report an error.

{ from the XML spec, §2.2 "Characters" }
[2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] 
| [#x10000-#x10FFFF]
  /* any Unicode character, excluding the surrogate blocks, FFFE, and 
FFFF. */
-- end quote -- 
(that's their comment in there, and it is technically incorrect, as 
Unicode recognises codepoints below U+0020 to be "characters")

The second mistake was getting the codepoint for TAB wrong (a perennial 
"misspelling" of mine) :-)

My problem is that I have source data with "unusual" codepoints. 
Particularly U+0000, and U+0010 to U+001F (valid Unicode, but not valid 
in XML).
--
Kristian

On 26 Jul 2004, at 14:08, Doug wrote:

> Kristian Walsh,
>
> I've not heard of invalid characters in an XML document. Usually,
> non-printable characters are represented by their character reference. 
> I
> frequently use '&#13;&#10;' to represent CR/LF.
>
> For example,
>
> <trans-unit id="a920cf">
> 	<source xml:lang="en">Three tabs follow&#08;&#08;&#08; then the text
> continues</source>
> </trans-unit>
>
> Regards,
>
> Doug Domeny
> Software Analyst
>
> Ektron, Inc.
> +1 603 594-0249 x212
> http://www.ektron.com
>
>
>
> -----Original Message-----
> From: Kristian Walsh [mailto:listreader@byteform.com]
> Sent: Monday, July 26, 2004 6:28 AM
> To: xliff-comment@lists.oasis-open.org
> Subject: Preferred method of representing invalid XML chars in 
> <source>?
>
>
> Hi,
>
> I am developing an application which creates XLIFF 1.0 documents from
> source data. Unfortunately, sometimes this source data contains
> character codes below U+0020, which are invalid in an XML document.
>
> I am unsure of the "canonical" way to deal with this in XLIFF 1.0
> (version 1.1 is not an option for this application); as far as I can
> see, <x/>, <g/> and <ph> can all be used for this purpose, as below:
>
>
> Form 1: <x>
>
> <trans-unit id="a920cf">
> 	<source xml:lang="en">Three tabs follow<x id="a920d0"
> ctype="character" clone="yes" ts="MyTool:chars=0008,0008,0008"> then
> the text continues</source>
> </trans-unit>
>
>
> Form 2: <g>
>
> <trans-unit id="a920cf">
> 	<source xml:lang="en">Three tabs follow<g id="a920d0"
> ctype="character" clone="yes" ts="MyTool:chars">0008,0008,0008</g> then
> the text continues</source>
> </trans-unit>
>
>
> Form 3: <ph>
>
> <trans-unit id="a920cf">
> 	<source xml:lang="en">Three tabs follow<ph id="a920d0"
> ctype="character" ts="MyTool:chars">0008,0008,0008</ph> then the text
> continues</source>
> </trans-unit>
>
>
> So my two questions are:
>
>   1. Which of the above forms is preferred in XLIFF 1.0 for 
> representing
> non-XML characters inside source (and/or target) data?
>
>   2. Is there a standard ctype attribute value for "raw character 
> codes"?
>
> Any ideas would be greatly appreciated,
> --
> Kristian
>

Follow-Ups:
- RE: Preferred method of representing invalid XML chars in <source>?
  - From: "Doug" <doug@ektron.com>

References:
- RE: Preferred method of representing invalid XML chars in <source>?
  - From: "Doug" <doug@ektron.com>