What Does XML Smell Like?
Michael Day, XML.com
This article introduces a set of heuristic rules for sniffing
the
content of a file in order to determine whether it is an XML
document
or an HTML document. An implementation is provided using the
xmlReader
interface of libxml2. This implementation is used in Prince,
a
formatter for creating PDF files from web documents. Say a user
agent
wants to load a web document and display it, format it, process
it,
or whatever. It might be an XML document, containing XHTML, SVG,
MathML, or a nutritious mix of these vocabularies. Or it might be
an
HTML document, ideally valid HTML4, but more likely an unappetizing
bowl of tag soup. The problem is, how does the user agent know
whether
to parse the document as XML or HTML? If the document is being
retrieved over the Web, then there is no problem, as the HTTP
response
will come with a Content-Type header that gives the MIME type of
the
document. This may be text/html for HTML, application/xml for XML
or
'application/xhtml+xml' for XHTML. The user agent can check the
MIME
type before trying to parse the document, and all is well. However,
if the document is being loaded from a local file, there is no
obvious
way to determine if it is XML or HTML. The user agent might try
checking the file extension, but what if it is .html? It is common
for
XHTML files to be given an extension of .html or .htm, as .xhtml
is
rather long and .xht is rather obscure. This means that a file with
an
extension of .html may actually be an XML document and require an
XML
parser. In some cases, documents will probably load, the user may
not
get what he expects, as style sheets and scripts may behave differently,
embedded SVG or MathML content will be garbled, and external
entities
and inclusions will not be resolved. Web user agents like Prince
need
a way to determine whether a .html file should be parsed as XML or
HTML.
In the absence of telepathy, there is no perfect algorithm to
determine
the intent of the author, so we will need to formulate some heuristics
that can sniff the content of the document and see if it smells
like
XML or HTML. In Prince, document sniffing heuristic rules are
implemented as a C function that uses the xmlReader interface
from
libxml2 to parse the document up until the first start tag or one
of
the heuristics matches. A copiously commented version of the code,
as
well as some sample documents to test it on, is available for
download
in the "Code" section below; it compiles to a small program that
sniffs
files and classifies them as being XML or HTML.
http://www.xml.com/pub/a/2007/02/28/what-does-xml-smell-like.html