xliff-comment message

Subject: RE: Preferred method of representing invalid XML chars in <source>?
From: "Doug" <doug@ektron.com>
To: "Kristian Walsh" <listreader@byteform.com>,<xliff-comment@lists.oasis-open.org>
Date: Mon, 26 Jul 2004 11:03:04 -0400
Kristian,

Thanks for the quote from the XML spec. I learned something new today! :-)

Regarding representing the invalid XML chars, I recommend using only
attributes and not text (PCDATA) on the principle that the XML document
should be readable as text only. That is, if all the tags were removed, the
remaining text should make sense. It's a philosophy of choosing between
using attributes or elements. A possible exception to this would be if the
character had an alternate representation in the native computer language
(e.g., C or Java). For example, <ph id="a920d0">\x10</ph>. But this may not
adequately hide the character code from the translator.

The choice between <g> and <ph> is in the XLIFF 1.1 spec (and is the same
for XLIFF 1.0).

"This element [<source>] may contain inline elements that either remove the
codes from the source (<g>, <x/>, <bx/>, <ex/>) or that mask off codes left
inline (<bpt>, <ept>, <sub>, <it>, <ph>)."
http://www.oasis-open.org/committees/xliff/documents/cs-xliff-core-1.1-20031
031.htm#Struct_Body

User-defined 'ctype' values must start with "x-". Additionally, I recommend
using a placeholder tag for each individual character rather than a
comma-separated list of values.

<trans-unit id="a920cf">
	<source xml:lang="en">Three char 8's follow<x id="a920d0" ctype="x-chr8"
clone="yes"/><x id="a920d1" ctype="x-chr8" clone="yes"/><x id="a920d2"
ctype="x-chr8" clone="yes"/> then the text continues</source>
</trans-unit>

Unless, that is, the code sequence is critical, in which case, you may wish
to represent it as a separate custom 'ctype'.

<trans-unit id="a920cf">
	<source xml:lang="en">Special sequence follows<x id="a920d0"
ctype="x-spcseq" clone="yes"/> then the text continues</source>
</trans-unit>

Regards,

Doug Domeny
Software Analyst

Ektron, Inc.
+1 603 594-0249 x212
http://www.ektron.com



-----Original Message-----
From: Kristian Walsh [mailto:listreader@byteform.com]
Sent: Monday, July 26, 2004 10:14 AM
To: xliff-comment@lists.oasis-open.org
Cc: doug@ektron.com
Subject: Re: Preferred method of representing invalid XML chars in
<source>?


Hi Doug,

I just realised I made two mistakes in my mail.

The first mistake was forgetting that U+000D, U+000A and U+0009 are the
only valid XML characters with codepoints below U+0020.
If you supply any other codepoint (e.g. &#07;) your XML parser should
report an error.

{ from the XML spec, §2.2 "Characters" }
[2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD]
| [#x10000-#x10FFFF]
  /* any Unicode character, excluding the surrogate blocks, FFFE, and
FFFF. */
-- end quote --
(that's their comment in there, and it is technically incorrect, as
Unicode recognises codepoints below U+0020 to be "characters")

The second mistake was getting the codepoint for TAB wrong (a perennial
"misspelling" of mine) :-)

My problem is that I have source data with "unusual" codepoints.
Particularly U+0000, and U+0010 to U+001F (valid Unicode, but not valid
in XML).
--
Kristian

On 26 Jul 2004, at 14:08, Doug wrote:

> Kristian Walsh,
>
> I've not heard of invalid characters in an XML document. Usually,
> non-printable characters are represented by their character reference.
> I
> frequently use '&#13;&#10;' to represent CR/LF.
>
> For example,
>
> <trans-unit id="a920cf">
> 	<source xml:lang="en">Three tabs follow&#08;&#08;&#08; then the text
> continues</source>
> </trans-unit>
>
> Regards,
>
> Doug Domeny
> Software Analyst
>
> Ektron, Inc.
> +1 603 594-0249 x212
> http://www.ektron.com
>
>
>
> -----Original Message-----
> From: Kristian Walsh [mailto:listreader@byteform.com]
> Sent: Monday, July 26, 2004 6:28 AM
> To: xliff-comment@lists.oasis-open.org
> Subject: Preferred method of representing invalid XML chars in
> <source>?
>
>
> Hi,
>
> I am developing an application which creates XLIFF 1.0 documents from
> source data. Unfortunately, sometimes this source data contains
> character codes below U+0020, which are invalid in an XML document.
>
> I am unsure of the "canonical" way to deal with this in XLIFF 1.0
> (version 1.1 is not an option for this application); as far as I can
> see, <x/>, <g/> and <ph> can all be used for this purpose, as below:
>
>
> Form 1: <x>
>
> <trans-unit id="a920cf">
> 	<source xml:lang="en">Three tabs follow<x id="a920d0"
> ctype="character" clone="yes" ts="MyTool:chars=0008,0008,0008"> then
> the text continues</source>
> </trans-unit>
>
>
> Form 2: <g>
>
> <trans-unit id="a920cf">
> 	<source xml:lang="en">Three tabs follow<g id="a920d0"
> ctype="character" clone="yes" ts="MyTool:chars">0008,0008,0008</g> then
> the text continues</source>
> </trans-unit>
>
>
> Form 3: <ph>
>
> <trans-unit id="a920cf">
> 	<source xml:lang="en">Three tabs follow<ph id="a920d0"
> ctype="character" ts="MyTool:chars">0008,0008,0008</ph> then the text
> continues</source>
> </trans-unit>
>
>
> So my two questions are:
>
>   1. Which of the above forms is preferred in XLIFF 1.0 for
> representing
> non-XML characters inside source (and/or target) data?
>
>   2. Is there a standard ctype attribute value for "raw character
> codes"?
>
> Any ideas would be greatly appreciated,
> --
> Kristian
>
References:
- Re: Preferred method of representing invalid XML chars in <source>?
  - From: Kristian Walsh <listreader@byteform.com>