docbook-apps message

Subject: Re: [docbook-apps] Capturing phrase books and dictionaries
From: Mike Maxwell <maxwell@umiacs.umd.edu>
To: Lech Rzedzicki <xchaotic@gmail.com>
Date: Thu, 18 Mar 2010 12:11:19 -0400
Lech Rzedzicki wrote:
> We're trying to keep our markup close to DB5 but we also want to
> tighten the schema a bit further.
> One area we're particularly struggling with is phrase books and
> dictionaries. This was originally modelled using TEI and reflects the
> actual structure quite well.
> The problem we have is that both in the original language portion
> (form) and in the the target language explanation (sense) we need to
> allow many optional elements such as example, pronunciation, often
> multiple times (as there can be many forms or senses or many examples
> for each sense or form), gradually this led us to a very complex and
> loose model which also doesn't maintain the relationship between the
> original and translation too well.
> 
> I was wondering if any of you have any experience dealing with similar
> content and whether you could share your experience and schemas?

We are working a lot with XML-based bilingual dictionaries (not phrase 
books, although they may be similar).  I think the bottom line is, don't 
use DocBook for dictionaries (at least not for the body of the 
dictionary, i.e. all the entries).  It just isn't the same kind of 
structure.

TEI-encoded dictionaries tend to reflect the structure of the print 
dictionary from which the electronic form was derived.  That has a 
couple advantages:
1) It's easy(-er) to convert from the print form to the electronic form, 
and go back later and make sure you did it right
2) It makes producing a new print copy of the dictionary that looks like 
the original print dictionary easy(-er).

It also has some disadvantages:
1) Unless you're working with a bunch of similar dictionaries from a 
single publisher, you're likely to wind up with a large number of 
schemas (or DTDs), one for each dictionary, and that can be hard to manage.
2) The large number of schemas in (1) also means that you probably have 
to write a different CSS (or whatever you use) for each one.
3) You're limited to a single presentation form, i.e. it is difficult to 
display a root-based dictionary as a stem-based dictionary.

What we (and probably most people who work with multiple electronic 
dictionaries) do instead, is to use a generic lexicon schema.  This 
flattens the overall structure of a typical print dictionary (e.g. 
subentries become entries on their own); the original structure is 
instead represented by xrefs (so a sub-entry and a minor entry both have 
pointers back to the main entry). One can then postpone until run-time 
decisions like root-based vs. stem-based presentation, or whether a 
given minor entry is displayed as a sub-entry or as an entry on its own 
(and perhaps alphabetized on its own, if that's relevant to the 
electronic display).  The run-time decisions are then implemented using 
one of two (or several) style sheets.

More than that about this approach (as opposed to doing something with 
dictionaries inside DocBook) probably doesn't belong on this list. 
Fortunately there are lexicography mailing lists, e.g. the Lexicography 
list (see http://linguistlist.org/lists/get-lists.cfm).
-- 
    Mike Maxwell
    What good is a universe without somebody around to look at it?
    --Robert Dicke, Princeton physicist
References:
- Capturing phrase books and dictionaries
  - From: Lech Rzedzicki <xchaotic@gmail.com>