xliff message

Subject: PUA Characters

From: Yves Savourel <ysavourel@translate.com>
To: 'Arle Lommel - LISA standards and publications' <arle@lisa.org>
Date: Thu, 12 Mar 2009 20:42:18 -0600

Feedback for TMX 2.0 proposal, with relation to XLIFF:

In section 1.2 "Character Encoding" of the TMX 2.0 draft there is the following paragraph:

"In addition, if the source database or application generating a TMX file uses character codes in the Private Use Area of Unicode
(code points U+E000-U+F8FF) it must convert those code points to their corresponding character entities in TMX files. For example,
if a source document uses the "fft" ligature found in certain Adobe OpenType fonts at code point U+E097 in the Private Use Area, the
corresponding TMX document would represent this character as &xE097;. This process is required since many text-processing tools do
not support the PUA. Inclusion of such character entities in TMX files may necessitate additional negotiation between the creator
and receiver of the file if such code points are to be properly interpreted. Such negotiations are outside the scope of the TMX
standard and use of the PUA is discouraged when possible."

I believe the proper terminology is:
The construct &xE097; is a 'numeric character reference' not a 'character entity':
A character entity = <!ENTITY fft "&#xE097;">
A character entity reference = &fft;
A numeric character reference = &#xE097; (or &#57495;)

So there are the following errors:

1) "character as &xE097;." is incorrect: It should be either "character as &#xE097;." or "character as &#57495;."

2) "those code points to their corresponding character entities in TMX" is incorrect: It should be "those code points to their
corresponding numeric character references in TMX".

3) "Inclusion of such character entities in" should be "Inclusion of such numeric character references in".

Overall Comment on this requirement:

I disagree strongly with this requirement. The only stated reason for it is "many text-processing tools do not support the PUA".
That is the problem of the text-processing tool, not TMX.

The burden of having to check TMX content character for such character (to know if it needs to be escaped as NCR or not) is
disproportionate to the problem that is being resolved. More importantly, TMX should not try to resolve that problem. It should not
have burdensome requirements based on the shortcomings of some tools (which are not even TMX-specific tools).

TMX should just be a normal XML citizen. For example: Many text-processing tools do not support the use of a BOM in UTF-8 file, but
TMX, correctly following the Unicode standard, allows it.

I think PUA characters also need to be allow without specific distinction in XLIFF 2.0 as they are currently.

-ys