Pre-scanning XML content to determine characteristics

What Does XML Smell Like?
Michael Day, XML.com

This article introduces a set of heuristic rules for sniffing the
content of a file in order to determine whether it is an XML document
or an HTML document. An implementation is provided using the xmlReader
interface of libxml2. This implementation is used in Prince, a
formatter for creating PDF files from web documents. Say a user agent
wants to load a web document and display it, format it, process it,
or whatever. It might be an XML document, containing XHTML, SVG,
MathML, or a nutritious mix of these vocabularies. Or it might be an
HTML document, ideally valid HTML4, but more likely an unappetizing
bowl of tag soup. The problem is, how does the user agent know whether
to parse the document as XML or HTML? If the document is being
retrieved over the Web, then there is no problem, as the HTTP response
will come with a Content-Type header that gives the MIME type of the
document. This may be text/html for HTML, application/xml for XML or
'application/xhtml+xml' for XHTML. The user agent can check the MIME
type before trying to parse the document, and all is well. However,
if the document is being loaded from a local file, there is no obvious
way to determine if it is XML or HTML. The user agent might try
checking the file extension, but what if it is .html? It is common for
XHTML files to be given an extension of .html or .htm, as .xhtml is
rather long and .xht is rather obscure. This means that a file with an
extension of .html may actually be an XML document and require an XML
parser. In some cases, documents will probably load, the user may not
get what he expects, as style sheets and scripts may behave differently,
embedded SVG or MathML content will be garbled, and external entities
and inclusions will not be resolved. Web user agents like Prince need
a way to determine whether a .html file should be parsed as XML or HTML.
In the absence of telepathy, there is no perfect algorithm to determine
the intent of the author, so we will need to formulate some heuristics
that can sniff the content of the document and see if it smells like
XML or HTML. In Prince, document sniffing heuristic rules are
implemented as a C function that uses the xmlReader interface from
libxml2 to parse the document up until the first start tag or one of
the heuristics matches. A copiously commented version of the code, as
well as some sample documents to test it on, is available for download
in the "Code" section below; it compiles to a small program that sniffs
files and classifies them as being XML or HTML.

http://www.xml.com/pub/a/2007/02/28/what-does-xml-smell-like.html

cam message