OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

docbook-apps message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]

Subject: Re: [docbook-apps] change default HTML encoding to UTF-8

Hi Bob. Do the stylesheets output both html 4, html 5, xhtml and xhtml5? Or did you conflate html 4 and html 5? See more below.

On 14 Aug 2017, at 18:48, Bob Stayton wrote:

We have a bug report suggesting that the default output encoding for the DocBook html stylesheet be changed from ISO-8859-1 to UTF-8.

I agree with this bug report. Why? Well, for one thing, you - here - talk about "html", and "html" today means "html 5". HTML 5.x recommends that documents are authored using UTF-8.

Also, when I look at the link in the forwarded message (https://www.oxygenxml.com/forum/viewtopic.php?f=6&t=14812&p=43711#p43711), I note that the discussion thread talks about HTML 5. I am not able to see that HTML 4 is mentioned at all in that thread.

Note this only applies to the original HTML 4 output from the "html" directory.

Are you saying that the stylesheet also outputs HTML 5? (Note that I ask about "HTML 5" and not about xhtml or xhtml5.)

The "xhtml" and "xhtml5" outputs already output UTF.

The justification for that ought to be that XML defaults to UTF-8. Xhtml and xhtml5 are not 'html'.

The original HTML 4 standard said ISO-8859-1 was the default encoding, but that UTF-8 would be acceptable.

I am not able to find such statement in the HTMl 4 specification. I looked at the one page version: https://www.w3.org/TR/html401/html40.txt

UTF-8 ”took over” as the dominant encoding on the Web long before HTML 5 became the official version of HTML.

Technically speaking ISO-8859-1 is STILL the default HTML encoding, from user agents’ perspective. It is only from an authoring perspective that HTML 5 recommends UTF-8.

DocBook stylesheets is an authoring tool. THere is only one processing model for HTML, and that model is defined by the latets HTML spec. Thus it should use UTF-8.

At the very least, the DocBook stylesheet should not use the HTML 4 specification as a justification for failing to output HTML 5 as UTF-8.

It isn't difficult for a user to change the output to UTF-8, but it does require a customization. The question here is whether to change the default output encoding to UTF-8.

If the user has to change the output to UTF-8 in order to produce HTML 5 output, then the stylesheet does not follow HTML5’s recommendations.

The fact that the user can produce XHTMl - and thus automatically get UTF-8 - does not alter the picture.

This would change the HTML output to replace character references like &#xXXXX; to actual UTF-8 encoded characters, and change the encoding information in the header to reflect that.

This would be nice. But just for the record: HTML 5.x does not recommend against using character references. Thus, if need be, you CAN pick a compromise: you can continue to output the character references and yet label the document as <meta content="http-equiv" content="text/html;charset=UTF-8">. This would then meet HTML 5’s recommendation.

I'm reluctant to change something that will break the builds that DocBook people depend on. Would this impact you if the change was made?

One thing to perhaps consider s whether interaction between external CSS stylesheets (that DocBook may produce) and the HTML output is affected. I do not think so, but perhaps there are some edge cases. If you need, I can look into it.


Bob Stayton

-------- Forwarded Message --------

[bugs:#1400] Default encoding for HTML-based outputs
Status: open
Group: output: HTML
Created: Thu Aug 10, 2017 11:41 AM UTC by Radu Coravu
Last Updated: Thu Aug 10, 2017 11:41 AM UTC
Owner: nobody

One of our clients reported that the default output encoding for Docbook to HTML is ISO 8859-1 which is not suitable at all for other languages with extended char sets like Russian:


Maybe the default language for HTML (and also for HTML chunk) should be changed to be UTF-8 as UTF-8 is already used as the default language for XHTML.

To unsubscribe, e-mail: docbook-apps-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: docbook-apps-help@lists.oasis-open.org

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]