dita message

Subject: My comments about DITA applicability torequirements and a separate standard for specialization
From: ehixson@bmc.com
To: dita@lists.oasis-open.org
Date: 8 Dec 2004 19:26:19 -0000
There has been much good discussion the past two days stemming from Eliot Kimber's concerns about whether DITA meets his clients' requirements, and his arguments for making DITA's specialization mechanism a separate standard on its own merit, without being tied to the DITA document types.  Michael Priestley has made some excellent counterpoints, and Paul Grosso and Joanne Hackos have also made good comments.

Here's what I think about the issues, speaking as someone who is square in the middle of a DITA implementation trial for my company.  FWIW, I have past experience in the SGML world, having designed and implemented a full-scale, CMS-based SGML-authoring environment with custom doctypes at a previous employer.  The DITA trial at my current employer is my second foray into implementing an SGML/XML authoring environment.

I've been at my current employer for 6 years, and I'd like to note that until DITA came along, I had a single response each and every time that the question arose regarding whether or not my company should migrate to XML.  That response was:  "Don't do it. It's too much pain for not enough gain."  Technical writing professionals who know a *little* bit about SGML/XML tend to buy into the belief that merely by switching to XML, you magically gain all manner of "single-sourcing" benefits, content interchange either within a large organization or with business partners, and automated, query-based document assembly.  I'm sure I don't have to point out to the members of this list how it doesn't turn out to be quite so simple.

DITA was the first emerging XML standard/technology I'd seen that actually seemed to mitigate some of the long-standing problems with SGML/XML implementation.  It squarely addressed these problems on two fronts:

A -- DITA's mechanism of specialization and generalization, both for content source *and* for output processing, enables a company to effectively create it's own "custom" doctypes and yet interchange them with other groups using different custom doctypes.  As long as each group, whether different organizations within your company or external business partners, is using doctypes that built by using valid specializations of DITA base classes, than all manner of content interchange and reuse is relatively painless.  Sure, you are somewhat constrained by the "topic-oriented architecture" of the base DITA doctype designs, but it's certainly much better than trying to standardize on a single rigid doctype like DocBook. And the "topic-oriented architecture" of DITA is still extremely generalized, which makes it relatively easy to specialize into extremely diverse structures.

B -- DITA's topic/map structure and the extremely generalized structure of the base TOPIC doctype and MAP doctype.  You can specialize a *huge* variety of custom doctypes from the base classes in TOPIC, making it easy for your content authors to develop practically any kind of small, reusable chunk of content.  And the MAP doctype is itself able to specialized, if necessary, to make it easier for your information product authors to create multiple types of online or hardcopy-based information products. The flexible nature of a DITA map enables information product authors to design any manner of hierarchical structure appropriate for "book" type outputs or "help" type outputs or "web presentation" outputs.

My point here is that it's both A and B that make DITA a value proposition to a company like mine.  The mechanism of specialization-generalization *alone* means nothing to my company, nor would it make me recommend taking the plunge to XML-based authoring here.  Alone, we would still be faced with modeling and creating our own custom doctypes completely from scratch.  We would also be faced with coding all of our output transforms from scratch.  Even worse, we would be locked out of any possibility of content interchange with our business partners.

But when A and B are taken together, we have a winning value proprosition.  We have an extremely flexible basic architecture that enables us to create doctypes that reflect our own particular semantics and information models, yet our custom doctypes can still be mixed and matched in any combination in the DITA maps that we use to create our information products.  And there is great flexibility in the structures we can create in each information product.  Even better, we can exchange content with any other organization, internal or external, as long as we're all based on DITA. Also, we have a robust set of basic processing code that already performs many useful transforms.  We don't have to reinvent the wheel nor build an entire XML-based system from scratch.  We only have to *add* little pieces here and there to give us the additional functionality that we need. Everything that comes with the DITA public package can be used as-is, or extended with small, modular tweaks.  We can even interchange our specializations to the transforms, if we develop them well.  In similar fashion, we might be able to take advantage of publically-offered transforms from the DITA community, saving us even more time in developing and maintaining our own XML implementation.

So I see no significant value in trying to create a separate standard, either now in 1.0 or in some future version of DITA, just for DITA's specialization-generalization concept and the one transform that is generic enough to be considered useful as a standalone transform for such a standard.  Instead, I think we could provide sufficient value to the XML community by publishing a simple white paper as part of the DITA package that:

* explains the concept and implementation of the specialization-generalization mechanism and the concept of "base classes"
* explains the code used in generalize.xsl
* explains how to adapt techniques used generalize.xsl to overlay a specialization-generalization mechanism onto *any* XML doctype.

---------

As for Eliot's concerns about the DITA doctypes not having sufficiently diverse generic base classes from which to specialize doctypes suitable to his clients' requirements, we had arrived at the same general principles that Michael Priestley has pointed out in some of his replies on this recent subject:

1. We consider the TOPIC doctype as generic enough and flexible enough to specialize into any structure we need.  We do this primarily by:

*  using <topic> as our base class for containerizing modular chunks of content.  The root <topic> is our doctype, and we use nested <topic> elements to give us as many heading levels as we might need for a given doctype. We consider these nested <topic> elements as "sub-topics" that would *generally* not be used on their own, outside the context of the doctype in which they reside, but they can be reused out of context via <topicref> in a DITA map or via a CONREF attribute into another doctype.   

*  using <body> as our base class for controlling the ordinality/cardinality of titled sub-divisions or block elements in a topic.  I cannot see a need to create peer elements of <body> as Eliot pointed out in one of his emails in this "thread."  If you need peer elements of body, aren't you effectively creating multiple topics?  You can use a specialization of the DITABASE doctype to achieve this effect if you need to have multiple topics at the same peer level but containerize them together (inside the <dita> element). Or you can nest multiple topics at the same peer level after the body of a "root" topic.

*  using <section> as our base class for any kind of titled subdivision within a topic

*  using <p> as our base class for any kind of untitled block element

*  using <fig> as our base class for any kind of titled block element

*  using <ph> as our base class for any kind of inline element

*  using <sl> as our base class for any kind of simple list element

*  using <ul> as our base class for any kind of block-containing list element

*  using <dl> as our base class for any kind of term/def pair list element

*  using <simpletable> as our base class for any kind of untitled table element

*  using <table> as our base class for any kind of titled table element

We have not encountered an internal model yet that we cannot accomodate with specializations from these base classes.  Sure, some of them may not be exact semantic "ancestors" to the content type we create, but as Michael pointed out, it's a "so what?" situation.  The new, specialized element itself is what denotes the semantics, not the name of the base class.  Also, when DITA evolves to contain new base classes in the TOPIC doctype, it's a trivial task to redefine the CLASS attributes in our custom doctypes to point to a new, more semantically equivalent base class.

2.  We know that it's a no-no to attempt to *add* a new base class of any sort to the TOPIC doctype.  Really, when we want to *add* anything, we've been able to do so by specializing one of the designated base types I mention above.  As for *changing* the content model of a base class to suit our needs, there's a relatively simple and safe way to "break" DITA without sacrificing the ability to exchange content with other organizations.

You can safely "break" DITA and redefine a base class as long as you make your redefinition more restrictive than the original.  For example, it's safe to redefine SECTION.CNT in topic.mod as follows:

From:  <!ENTITY % section.cnt          "#PCDATA | %basic.ph; | %basic.block; | %title; |  %txt.incl;">

To:      <!ENTITY % section.cnt          "%title;, (%basic.block; | %txt.incl;)*">
           <!--[ehixson] changed this fragment model to be more restrictive than the original--> 

           <!ELEMENT section         (%section.cnt;) >
           <!--[ehixson] removed * cardinality operator to accomdate new model fragment for section.cnt-->

The trick to breaking DITA in this fashion is to do so *only* in the DTDs used in your authoring tools.  For all your output processing, you still use the unmodified DTDs from base DITA.  In other words, for the DTDs you put together for your custom doctypes:

* for authoring tools, you embed yourVersionOfTopic.mod in your DTD.
* for output processing, you embed topic.mod in your DTD.

What this buys you is a large degree of restrictive flexibility for your authors, yet your document instances are all still valid against base DITA and will process just fine.  You get to use all the base element names as-is, which makes it easy to introduce contractors or new writers who are familiar with DITA from previous jobs (as opposed to retraining them on your different element names because you chose to do this the hard way and create specialized variants of practically every base element just because you wanted to remove <lines> and <lq> from the model fragment for BASIC.BLOCK).

The *only* danger with this approach is with regard to support for CONREF in vendor tools.  Currently, for example, Arbortext's Epic Editor handles CONREF inclusions in a manner that does not attempt to validate the CONREFed content against the current doctype.  In the context of my example with the <section> element, this means you can safely CONREF a <section> from some third-party source that contains nothing but #PCDATA.  In my "broken" doctype, a <section> with only #PCDATA would be invalid.  Fortunately, Epic doesn't care about that and displays the CONREFed section without errors.  When I go to run my document instance through output processing, I'm then using a version of my doctype that uses the unmodified version of <section>, so when the processing parser sees that <section> with #PCDATA, it has no problem because that content model is valid in the DTD I use for processing.
Follow-Ups:
- Re: [dita] My comments about DITA applicability torequirements and a separate standard for specialization
  - From: "W. Eliot Kimber" <ekimber@innodata-isogen.com>