OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

docbook-apps message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: Re: [docbook-apps] docbook-xsl UTF-8


Hi Tony,
Actually, the HTML file from xsltproc is valid, but it is being misinterpreted by the 
other tools.

When I view it with two different hex dump tools (cygwin "dump" and Oxygen 12's Hex 
Viewer), the character is revealed as a single hex byte A0, which is what it should be 
for encoding in ISO-8859-1, which is the default encoding for HTML output produced by 
xsltproc.  I think that is the correct default encoding for HTML output, so I don't 
think xsltproc is doing anything wrong.

If the tool reading the file thinks the content is encoded as ASCII, then it will fail 
on that character as it is out of bounds of the first 128 characters.  If the tool 
thinks the content is encoded as UTF-8, then it will also fail, because the UTF-8 
encoding of nbsp is a two-byte sequence of hex C2 A0.  If the tool thinks the content 
is encoded as ISO-8859-1, then it works ok.  To get it right, the tool reading the 
file would have to understand the <meta> element in the file that specifies the 
encoding.

When I open the file in browsers, it displays ok because it has the meta element that 
defines its encoding as ISO-8859-1.  Even when that meta element is removed from the 
file it displays ok because HTML browsers default to ISO-8859-1 encoding, and A0 is in 
that encoding range.

So I think the xsltproc output file is correct, but the tools reading it are 
misinterpreting its encoding. The reason Saxon output does not fail is because Saxon 
handles HTML output differently.  Saxon's extension attribute 
saxon:character-representation for HTML output defaults to "entity;decimal" and the 
default output encoding for method="html" is ISO-8859-1.  The designation before the 
semicolon in saxon:character-representation refers to characters within the encoding. 
Since A0 is within the encoding, and there is an HTML entity declared for it, so Saxon 
outputs an entity reference as the string of characters "&nbsp;" instead of the native 
byte A0.  If you change that attribute value to "native;decimal" in a customization 
layer, then you would get A0 output from Saxon too.

You are wondering why xsltproc does not output the string "&#160;" instead of hex A0. 
It outputs A0 because it that character is within the range of the ISO-8859-1 output 
encoding.  It only outputs numerical character references when a character is outside 
the range of the specified output encoding.  If you were to specify an output encoding 
of ASCII, using a customization layer with this:

<xsl:output method="html" encoding="us-ascii"/>

then those characters would indeed be rendered in the output as the string "&#160;" as 
you want.  The file's encoding would be ASCII, and it should be readable by any tool. 
It should also still be valid and readable by any HTML browser.

I hope that somewhere in this long discourse I actually answered your question.  8^)

Bob Stayton
Sagehill Enterprises
bobs@sagehill.net


----- Original Message ----- 
From: "Tony Morris" <tmorris@tmorris.net>
To: <docbook-apps@lists.oasis-open.org>
Sent: Wednesday, June 15, 2011 5:10 AM
Subject: Re: [docbook-apps] docbook-xsl UTF-8


> Thanks for the response David.
>
> I can see in common/en.xml that the character code for those spaces is
> 160, however, the result is a file that is not a valid multi-byte
> sequence. This causes other tools to fail to read the file. Should it
> not render as &#160; in the HTML rather than a single byte with the
> value 0xa0?
>
> Thanks again for any tips.
>
>
> On 15/06/11 21:54, David Cramer wrote:
>> Hi Tony,
>> Those are non-breaking space characters added to prevent the line from
>> breaking between the label and the number. You're probably seeing them
>> in the output due to the way you have apache configured. For details on
>> how to avoid that, see the section titled "Odd characters in HTML
>> output" in Bob's book:
>> http://www.sagehill.net/docbookxsl/SpecialChars.html
>>
>> Btw., the strings those characters appear in come from the xslts (e.g.
>> common/en.xml in the stylesheet distribution). This system is described
>> in Bob's book at:
>> http://www.sagehill.net/docbookxsl/CustomGentext.html
>>
>> Regards,
>> David
>>
>> On 06/15/2011 04:06 AM, Tony Morris wrote:
>> > Hi,
>> > I am trying to convert a trivial XML file to HTML using
>> > docbook-xsl-1.76.1 but I end up with some strange characters (0x0a0a) in
>> > my resulting HTML file. I have also tried with a recent
>> > docbook-xsl-snapshot and had the same result.
>>
>> > I have pasted my entire problem here
>> > http://pastebin.com/AFrhet0D
>>
>> > You can see the undesirable characters in the "Part" and "Chapter" of
>> > the output. I'd appreciate any suggestions to get this working. Cheers.
>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: docbook-apps-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: docbook-apps-help@lists.oasis-open.org
>
>
>
> -- 
> Tony Morris
> http://tmorris.net/
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: docbook-apps-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: docbook-apps-help@lists.oasis-open.org
>
>
> 



[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]