[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: Re: [docbook-apps] docbook-xsl UTF-8
Hi Tony, Actually, the HTML file from xsltproc is valid, but it is being misinterpreted by the other tools. When I view it with two different hex dump tools (cygwin "dump" and Oxygen 12's Hex Viewer), the character is revealed as a single hex byte A0, which is what it should be for encoding in ISO-8859-1, which is the default encoding for HTML output produced by xsltproc. I think that is the correct default encoding for HTML output, so I don't think xsltproc is doing anything wrong. If the tool reading the file thinks the content is encoded as ASCII, then it will fail on that character as it is out of bounds of the first 128 characters. If the tool thinks the content is encoded as UTF-8, then it will also fail, because the UTF-8 encoding of nbsp is a two-byte sequence of hex C2 A0. If the tool thinks the content is encoded as ISO-8859-1, then it works ok. To get it right, the tool reading the file would have to understand the <meta> element in the file that specifies the encoding. When I open the file in browsers, it displays ok because it has the meta element that defines its encoding as ISO-8859-1. Even when that meta element is removed from the file it displays ok because HTML browsers default to ISO-8859-1 encoding, and A0 is in that encoding range. So I think the xsltproc output file is correct, but the tools reading it are misinterpreting its encoding. The reason Saxon output does not fail is because Saxon handles HTML output differently. Saxon's extension attribute saxon:character-representation for HTML output defaults to "entity;decimal" and the default output encoding for method="html" is ISO-8859-1. The designation before the semicolon in saxon:character-representation refers to characters within the encoding. Since A0 is within the encoding, and there is an HTML entity declared for it, so Saxon outputs an entity reference as the string of characters " " instead of the native byte A0. If you change that attribute value to "native;decimal" in a customization layer, then you would get A0 output from Saxon too. You are wondering why xsltproc does not output the string " " instead of hex A0. It outputs A0 because it that character is within the range of the ISO-8859-1 output encoding. It only outputs numerical character references when a character is outside the range of the specified output encoding. If you were to specify an output encoding of ASCII, using a customization layer with this: <xsl:output method="html" encoding="us-ascii"/> then those characters would indeed be rendered in the output as the string " " as you want. The file's encoding would be ASCII, and it should be readable by any tool. It should also still be valid and readable by any HTML browser. I hope that somewhere in this long discourse I actually answered your question. 8^) Bob Stayton Sagehill Enterprises bobs@sagehill.net ----- Original Message ----- From: "Tony Morris" <tmorris@tmorris.net> To: <docbook-apps@lists.oasis-open.org> Sent: Wednesday, June 15, 2011 5:10 AM Subject: Re: [docbook-apps] docbook-xsl UTF-8 > Thanks for the response David. > > I can see in common/en.xml that the character code for those spaces is > 160, however, the result is a file that is not a valid multi-byte > sequence. This causes other tools to fail to read the file. Should it > not render as   in the HTML rather than a single byte with the > value 0xa0? > > Thanks again for any tips. > > > On 15/06/11 21:54, David Cramer wrote: >> Hi Tony, >> Those are non-breaking space characters added to prevent the line from >> breaking between the label and the number. You're probably seeing them >> in the output due to the way you have apache configured. For details on >> how to avoid that, see the section titled "Odd characters in HTML >> output" in Bob's book: >> http://www.sagehill.net/docbookxsl/SpecialChars.html >> >> Btw., the strings those characters appear in come from the xslts (e.g. >> common/en.xml in the stylesheet distribution). This system is described >> in Bob's book at: >> http://www.sagehill.net/docbookxsl/CustomGentext.html >> >> Regards, >> David >> >> On 06/15/2011 04:06 AM, Tony Morris wrote: >> > Hi, >> > I am trying to convert a trivial XML file to HTML using >> > docbook-xsl-1.76.1 but I end up with some strange characters (0x0a0a) in >> > my resulting HTML file. I have also tried with a recent >> > docbook-xsl-snapshot and had the same result. >> >> > I have pasted my entire problem here >> > http://pastebin.com/AFrhet0D >> >> > You can see the undesirable characters in the "Part" and "Chapter" of >> > the output. I'd appreciate any suggestions to get this working. Cheers. >> >> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: docbook-apps-unsubscribe@lists.oasis-open.org > For additional commands, e-mail: docbook-apps-help@lists.oasis-open.org > > > > -- > Tony Morris > http://tmorris.net/ > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: docbook-apps-unsubscribe@lists.oasis-open.org > For additional commands, e-mail: docbook-apps-help@lists.oasis-open.org > > >
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]