[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: Re: [docbook] invalid characters for ISO-8859-1 response
----- Original Message ----- From: "Anthony Ettinger" <anthony@chovy.com> To: "Bob Stayton" <bobs@sagehill.net> Cc: "Dave Pawson" <davep@dpawson.co.uk>; <docbook@lists.oasis-open.org> Sent: Wednesday, October 31, 2007 1:09 PM Subject: Re: [docbook] invalid characters for ISO-8859-1 response > > Sure, unicode makes sense...I could be missing something but I > would've left entity references alone...I still don't see what is > gained by converting Œ vs. just leaving it as Œ in the > output...or simply leaving it as a space. Ah, now I think I see what you are getting at. If you type   for a non-breaking space, why doesn't it preserve that character as the string " " in the output? The answer is that the input representation has no direct connection to the output representation. When an input XML document is parsed into memory, all characters are converted to Unicode internally, regardless of their initial representation. There is no record in the loaded memory that the input was " ", it is all Unicode in memory. After processing in memory, the XML is output using a serializer whose job is to convert the Unicode strings into an output string in some encoding. An encoding has to be chosen, and it is not selected based on the input encoding (which is no longer known to the processor). The default output encoding is UTF-8, but you can specify any of several different encodings for the serializer to use. That said, one option you might look at is using Saxon instead of libxml2, and use a Saxon extension to control how characters are represented in the output. After all, even if your output encoding is UTF-8, you could still output the six-character string " " for a non-breaking space instead of the UTF-8 single hex character, and it would still be interpreted as a non-breaking space. Saxon provides that choice. See: http://www.sagehill.net/docbookxsl/OutputEncoding.html#SaxonCharacter Bob Stayton Sagehill Enterprises DocBook Consulting bobs@sagehill.net
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]