docbook message

Subject: RE: [docbook] Why use DocBook when there is =?UTF-8?Q?HTML=3F?=

From: maxwell <maxwell@umiacs.umd.edu>
To: <docbook@lists.oasis-open.org>
Date: Tue, 03 Nov 2009 17:17:51 -0500

Mauritz Jeanson wrote:
> Mark Pilgrim abandoned DocBook in favour of HTML...
> What are your thoughts about this? 

We're sort of off in a little corner of our own, so our mileage probably
differs from everyone else's.  But we find DocBook indispensable.

Our corner is doing literate programming of grammars of natural languages. 
This means embedding formal grammar fragments into a prose (descriptive)
grammar; the fragments can be extracted and turned into a complete
(morphological and phonological, not syntactic) grammar, which can in turn
be converted into a morphological parser.  The prose grammar acts as the
documentation of the formal grammar.

There are several factors that make DocBook seem like the right way to go. 
First, it's content markup, not formatting markup.  To be sure, we've had
to add some content markup tags (for interlinear text, but also the
literate programming tags--we used Norm Walsh's extensions).  But the use
of content markup allows us to do things like extracting all the words in
the target language (but not, for example, individual suffixes appearing in
the text) to run them through the parser for purposes of verification.  We
can also extend the DocBook XML to embed an entire lexicon for test
purposes (the lexicon would of course have its own internal tags).

Second, the work we're doing (unlike Mark Pilgrim's Python book) is
explicitly targeted at Forever.  Grammars never get superseded: there's an
entire industry of documenting and describing endangered languages, and of
course there are useful grammars of languages which have been extinct for
thousands of years.  (Linguists never throw anything away :-).)  So the
content markup tags of DocBook XML provide what I believe is a better way
for documentation which will be interpretable for the long term (hundreds
or maybe even thousands of years).

Third, some of the things we're doing are very messy to typeset.  Our last
grammar was of Urdu, which uses an almost calligraphic version of the
Arabic script called Nasta'liq.  Short of typesetting Mongolian vertically,
I guess this is as far as you could get from ASCII.  I don't think it would
render well in HTML.  To be honest, we didn't try to render it using the
standard XSL-FO path either; we could only get what we wanted using XeTeX
(a Unicode-aware version of LaTeX), for which our conversion process relies
on an open source program called dblatex.  The result is output as a PDF.

Maybe there is a way to do the above in HTML, but when we were figuring out
how to do it, we didn't run across such a method.

I'll take this opportunity to say that one of the things that seems odd to
me about DocBook is that it is targeted so explicitly at computer
documentation.  Many of its tags make no sense outside that context.  So we
have modified the schema not just by adding elements for linguistics and
literate programming, but by removing many of the tags that are blatantly
irrelevant.  Computer documentation is the sort of thing that will, in most
cases, go out of date soon; and for that purpose, maybe Pilgrim is right
that HTML makes sense.  But there are plenty of domains for which people
write books that don't go out of date (ranging from poetry to archaeology),
and for which DocBook might make more sense to people if it didn't seem so
much like a geek's view of the world.  My 2/100 of a dollar...

   Mike Maxwell
   CASL/ U MD

Follow-Ups:
- Re: [docbook] Why use DocBook when there is HTML?
  - From: Scott Hudson <scott.hudson@flatironssolutions.com>

References:
- Why use DocBook when there is HTML?
  - From: "Mauritz Jeanson" <mj@johanneberg.com>
- RE: [docbook] Why use DocBook when there is HTML?
  - From: "Barton Wright" <barton.wright@streambase.com>