OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

docbook-apps message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: Re: [docbook-apps] change default HTML encoding to UTF-8


Hi Leif,
Thanks for taking the time to look into this in more detail. I have some responses below that I think will clarify the situation.

Bob Stayton
Sagehill Enterprises
bobs@sagehill.net

On 8/15/2017 6:44 AM, Leif Halvard Silli wrote:
Hi Bob. Do the stylesheets output both html 4, html 5, xhtml and xhtml5? Or did you conflate html 4 and html 5? See more below.

The DocBook distribution has these stylesheets:

html - outputs HTML 4
xhtml - outputs XHTML 1.0
xhtml-1_1 - outputs XHTML 1.1 (mainly used for EPUB 2)
xhtml5 - outputs polyglot HTML 5

There is no stylesheet that outputs HTML 5 that is not serialized as XML. Here is the description of polyglot HTML 5 from Wikipedia:

"Polyglot HTML is HTML that has been written to conform to both the HTML and XHTML specifications.[1] A polyglot document can therefore be parsed as either HTML (which is SGML-compatible) or XML, and will produce the same DOM structure either way. For example, in order for an HTML5 document to meet these criteria, the two requirements are that it must have an HTML5 doctype, and be written in well-formed XHTML.[2] The same document can then be served as either HTML or XHTML, depending on browser support and MIME type."

I named the directory "xhtml5" to indicate that the output is parsable as XML. Those stylesheets output the DOCTYPE declaration expected of HTML 5 and the XHTML namespace declaration expected of XHTML.

On 14 Aug 2017, at 18:48, Bob Stayton wrote:

We have a bug report suggesting that the default output encoding for the DocBook html stylesheet be changed from ISO-8859-1 to UTF-8.

I agree with this bug report. Why? Well, for one thing, you - here - talk about "html", and "html" today means "html 5". HTML 5.x recommends that documents are authored using UTF-8.

In the DocBook stylesheet directory name, "html" means HTML 4. The XHTML 5 stylesheet outputs UTF-8.

Also, when I look at the link in the forwarded message (https://www.oxygenxml.com/forum/viewtopic.php?f=6&t=14812&p=43711#p43711), I note that the discussion thread talks about HTML 5. I am not able to see that HTML 4 is mentioned at all in that thread.


I think this is the source of the confusion. I missed the subject line that said "HTML 5". Since they
mentioned iso-8859-1, I assumed they were talking about the
"html" stylesheets, which are the original HTML 4 output.
So they were trying to get HTML 5 output but were using the "html" stylesheet.

Note this only applies to the original HTML 4 output from the "html" directory.

Right.


Are you saying that the stylesheet also outputs HTML 5? (Note that I ask about "HTML 5" and not about xhtml or xhtml5.)

The "xhtml5" directory outputs polyglot HTML 5.


The "xhtml" and "xhtml5" outputs already output UTF.

Right.


The justification for that ought to be that XML defaults to UTF-8. Xhtml and xhtml5 are not 'html'.

Well, I would say the W3C muddied that pond when they created polyglot HTML 5.


The original HTML 4 standard said ISO-8859-1 was the default encoding, but that UTF-8 would be acceptable.

I am not able to find such statement in the HTMl 4 specification. I looked at the one page version: https://www.w3.org/TR/html401/html40.txt

I found that statement here on the W3C website:

https://www.w3schools.com/html/html_charset.asp

UTF-8 ”took over” as the dominant encoding on the Web long before HTML 5 became the official version of HTML.

Yes, no argument there.

Technically speaking ISO-8859-1 is STILL the default HTML encoding, from user agents’ perspective. It is only from an authoring perspective that HTML 5 recommends UTF-8.

DocBook stylesheets is an authoring tool. THere is only one processing model for HTML, and that model is defined by the latets HTML spec. Thus it should use UTF-8.

At the very least, the DocBook stylesheet should not use the HTML 4 specification as a justification for failing to output HTML 5 as UTF-8.

It does not. If a user wants HTML 5 they will need to use the "xhtml5" stylesheets in the distribution, and they will get UTF-8.

It isn't difficult for a user to change the output to UTF-8, but it does require a customization. The question here is whether to change the default output encoding to UTF-8.

If the user has to change the output to UTF-8 in order to produce HTML 5 output, then the stylesheet does not follow HTML5’s recommendations.

No, this user should have selected the "xhtml5" stylesheet if they want HTML 5 output. No amount of customization will get the "html" stylesheet to output HTML 5.

The DocBook XSL development process takes great pains to maintain backwards compatibility with its installed base. The reason the "html" directory still outputs HTML 4 is for backwards compatibility. Users that have built systems that use those stylesheets won't be surprised by suddenly getting HTML 5 output. If they want HTML 5 output, they should use the "xhtml5" directory.

I hope this clarified things.

Bob Stayton


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]