OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

docbook-apps message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]

Subject: Re: [docbook-apps] Generating separate closing tags in XHTML webhelp output

On Fri, Mar 4, 2011 at 11:50 PM, Peter Desjardins <peter.desjardins.us@gmail.com> wrote:
On Fri, Mar 4, 2011 at 11:02 AM, Kasun Gajasinghe
<kasun.gajasinghe@gmail.com> wrote:

> The main issue with HTML is with the html-search feature. To properly
> retrieve the content text excluding the html-tags, the html files should be
> in a proper format. Strict XML is the standard way for this. That's the
> concern here. I haven't encountered any other major issue in switching to
> html!
> Looking at your mail, I'm assume you are switching from html to xhtml,
> right? If so, have you encountered any concerns that needs some major
> effort? If so, tell us about it, we'll see about the possibility of
> supporting to html format too.

I switched from your default XHTML to HTML. I didn't see any problems
and I tried searching for a few terms. The search feature seemed to
work properly. Maybe XHTML isn't required for the webhelp format at

The HTML tree is not a well-formed XML tree, meaning there will be traversal issues if the html is parsed using a XML parser. The search would still work, but will be broken due to the possibility that some contents won't get indexed. These contents won't appear in the search results. It's something like what you said in the 3rd para in the first post about looking for </a> tag for the <a/> tag! You can't test this by searching for *few* queries. 

But from what I have seen, the un-indexed content for html is fairly low, and therefore you can depend on it with a small amount of error. On the other hand, XHTML is completely based on XML, so the SAX XML parser has no issue in parsing the content.

There's some tools out there to parse dirty HTML tags and retrieve it's whole content. But lot of good tools don't have a compatible license with DocBook. Htmlcleaner looks like a good solution for adding the support for indexing/searching html files though. So, full support for html would come!


Kasun Gajasinghe,
University of Moratuwa,
Sri Lanka.
Blog: http://kasunbg.blogspot.com
Twitter: http://twitter.com/kasunbg

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]