OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

docbook message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]

Subject: Re: [docbook] invalid characters for ISO-8859-1 response

----- Original Message ----- 
From: "Anthony Ettinger" <anthony@chovy.com>
To: "Bob Stayton" <bobs@sagehill.net>
Cc: "Dave Pawson" <davep@dpawson.co.uk>; <docbook@lists.oasis-open.org>
Sent: Wednesday, October 31, 2007 1:09 PM
Subject: Re: [docbook] invalid characters for ISO-8859-1 response

> Sure, unicode makes sense...I could be missing something but I
> would've left entity references alone...I still don't see what is
> gained by converting &#140; vs. just leaving it as &#140; in the
> output...or simply leaving it as a space.

Ah, now I think I see what you are getting at.  If you type &#160; for a 
non-breaking space, why doesn't it preserve that character as the string 
"&#160;" in the output?  The answer is that the input representation has no 
direct connection to the output representation.

When an input XML document is parsed into memory, all characters are 
converted to Unicode internally, regardless of their initial 
representation.  There is no record in the loaded memory that the input was 
"&#160;", it is all Unicode in memory.  After processing in memory, the XML 
is output using a serializer whose job is to convert the Unicode strings 
into an output string in some encoding.  An encoding has to be chosen, and 
it is not selected based on the input encoding (which is no longer known to 
the processor).  The default output encoding is UTF-8, but you can specify 
any of several different encodings for the serializer to use.

That said, one option you might look at is using Saxon instead of libxml2, 
and use a Saxon extension to control how characters are represented in the 
output.  After all, even if your output encoding is UTF-8, you could still 
output the six-character string "&#160;" for a non-breaking space instead 
of the UTF-8  single hex character, and it would still be interpreted as a 
non-breaking space.  Saxon provides that choice.  See:


Bob Stayton
Sagehill Enterprises
DocBook Consulting

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]