OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

xslt-conformance message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [Elist Home]


Subject: [xslt-conformance] This could be our answer for HTML comparison


Output-comparison subcommittee take note!
.................David Marston

---------------------- Forwarded by David Marston/Cambridge/IBM on
02/12/2002 12:27 AM ---------------------------


Andy Clark <andyc@apache.org> on 02/08/2002 10:15:36 PM

Please respond to general@xml.apache.org

To:    general@xml.apache.org
Subject:    [ANNOUNCE] Xerces HTML Parser

For a long time users have asked if Xerces can parse HTML files.
But since most HTML documents are not well-formed XML documents,
it is generally not possible to use a conforming XML parser to
read HTML documents.

However, the Xerces Native Interface (XNI) that is the foundation
of the Xerces2 implementation defines a framework that allows
different kinds of parsers to be constructed by connecting a
pipeline of parser components. Therefore, as long as a component
can be written that generates the appropriate XNI "events", then
it can be used to emit SAX events, build DOM trees, or anything
else that you can think of.

So, as a fun little exercise, I have written a basic HTML parser
using XNI. It consists of an HTML scanner component that can scan
HTML files and generate XNI events and a tag balancing component.
The tag balancer cleans up the events produced by the scanner,
balancing mismatched tags and adding tags where necessary. And
it does all of this in a streaming manner to minimize the amount
of memory required.

Since I wrote the HTML parser as an example of using XNI and
because the code is considered alpha quality (but it seems to
work quite well, actually!), I am posting the code with a very
limited license. Even though it contains the complete source
code for the HTML parser, the license only allows the user to
experiment but gives no right to actually use the code in a
product.

If the source isn't "free" or "open", why release it at all?
I want to get an idea of what people think of the code first.
Then, if there's enough interest, I would like to either donate
the code to the Xerces-J project or make it available elsewhere
under a true open source license.

So, if you've been looking for a way to parse HTML documents
please try out the HTML parser and let me know what you think.
There should be enough information in the documentation to get
you started. Check out the "NekoHTML" project listed on my
Apache web site: http://www.apache.org/~andyc/

Have fun!
--
Andy Clark * andyc@apache.org




[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [Elist Home]


Powered by eList eXpress LLC