dita message

Subject: attribute specialization as foundation

From: Erik Hennum <ehennum@us.ibm.com>
To: dita@lists.oasis-open.org
Date: Wed, 26 Apr 2006 21:05:41 -0700

Hi, DITA Committee Folk:

Since it came up, I'd like to summarize some ideas that have been brewing offline for a while now. Maybe the ambition for more significant attribute capabilities in the future can provide motivation for progress on attribute specialization now.

FWIW, a paper at last year's Extreme has more detail:

http://www.mulberrytech.com/Extreme/Proceedings/html/2005/Hennum01/EML2005Hennum01.html

Of course, the issues summarized herein require more thought and many perspectives to get right.

1. Specializing an attribute that takes a single value (not an enumeration)

If an element contains a value (that is, only text), a designer in DITA 1.0 can specialize that element by changing the name and restricting the value to specify a more precise semantic. For instance, we can specialize <apiname> as <javaClassName> or specialize <msgnum> to <httpErrorCode>.

In principle, the same kind of specialization should be possible for an attribute that takes a single value. For instance, a designer should be able to distinguish and enforce formats for the version, release, and modification attributes on <vrm>, for the id on <resourceid>, the content on <othermeta>, or the value on <state>.

In the same way that the specialized <parml> element can mandate a specialized <plentry> in its substructure, a specialized element should be able to mandate a specialized attribute. That ability to specialize an attribute as part of element substructure might be something to take on after DITA 1.1

2. Interoperability of a model over variant XML syntax

More fundamentally, could specialization allow mutability between a single-value attribute and a text-only subordinate element (a possibility that Bruce raised with respect to the <data> element)? For instance, could DITA recognize the following forms as identical?

<p owner="bjorn">It all began...</p>
<p><owner>bjorn</owner>It all began...</p>

Building on that, could DITA recognize equivalence between the subdivision of a value into fields via a pattern and fields in the content delimited by subordinate elements? For instance, could a base instance of

<bookinfo publisher="Bjornsen, Bjorn"/>

be specialized via a field pattern of "'(\w+),\s+(\w+)', lastname, firstname" as

<bookinfo><publisherIndividual>
.... <lastname>Bjornsen</lastname>
.... <firstname>Bjorn</firstname>
</publisherIndividual></bookinfo>

Similarly, could a different base instance of

<bookinfo publisher="AMLW - Amalgamated Widgets"/>

be specialized via a different field pattern of "'(\w+) - (\w+)', stock, company" as

<bookinfo><corporatePublisher>
.... <stock>AMLW</stock>
.... <company>Amalgamated Widgets</company>
</corporatePublisher></bookinfo>

This account is only a sketch of a direction, but this capability would let designers specify text for general content and still allow specialized elements for precision.

3. Bridging between definitions of controlled values and citations of controlled values

How might adopters define the controlled values for an enumeration -- especially in a way that permits extensibility of those values?

One possibility would be to use the key feature proposed for DITA 1.2 (credit to Mr. Priestley for that lightbulb):

Use a specialized DITA topic to define the meaning of the controlled value (a meta topic, if you will).
Use a specialized DITA map both to combine these definitional topics in groups (like operating system platform, machine type platform, audience education, and audience job) and to indicate semantic hierarchies within each group ("RedHat" is a special kind of "Linux," "appdev" is a special kind of "programmer").
Assign a key (effectively, a local name) to each definitional topic.
Use the keys as values in metadata attributes.

Benefits: The enumeration can be maintained by content creators without having to modify a schema definition. A process can still validate the enumeration (that is, check that the controlled values in topics have corresponding definitions). Where a controlled value without any definition might be ambiguous, a defined controlled value can be clarified by drilling down into the definitional topic. The definitions of controlled values can be shared easily between adopters and allow adopters to use different local names for the same thing (for instance, "linux" and "LinuxOS" and "unices.linux"). The taxonomic relationships can be maintained without forcing classification changes in the content. Definitional topics can be reused as content topics where the user would benefit from a definition of an unfamiliar concept. Finally, adopters can scale the formality of their practice from single controlled values to formal taxonomies without any change in their authoring infrastructure. (In fact, the DITA taxonomy specialization provides an implementation of the first two bullets above.)

4. Specializing an attribute that takes an enumeration

The two sides debating about attribute specialization seems to focus on different things.

One side has a focus on the semantics of the attributes, submitting that, if you analyze your audience by education, by job role, or by both, you are still analyzing your audience.

The other side has a focus on the values, submitting that any enumeration of audience education values requires additional information to merge with any other enumeration of audience values.

The second side has a point. Where the base and specialized attributes have a clear semantic relationship, the base enumeration would include values that are compounds of the values from the specialized enumerations. As a result, even in the best case, the mapping will be likely to be complex and partial.

For example, if operating system and machine are special kinds of platform, adopters might need mappings similar to the following:

adopter 1 (base) .......... platform = ( bigiron | openserver | wintel | handheld )

adopter 2 (specialized) ... os = ( linux | macosx | windows )
........................... machine = ( macintosh | mainframe | pc | server )

mapping ................... platform( bigiron ) MATCHES machine=( mainframe )
........................... platform( openserver ) MATCHES os( linux ) OR machine=( server )
........................... platform( wintel ) MATCHES os( windows ) OR machine=( pc )

So far as I know, no one wants to address that mapping challenge now. Besides, the DITA practice thus far has been to enable vocabulary agreements within communities in advance rather than to try to reconcile arbitrary vocabularies after the fact. So, let's acknowledge that we won't automate the mapping of values from different enumerations and thus won't automate integration of enumerations for conditional processing.

All that said -- does the first side have a point, too? If I need to enumerate the audience by education and know that I am providing an analysis of the audience, why should I be forced to treat audience education as if it were completely unrelated to audience? If I can declare that audienceEducation specializes audience, processes other than conditional processes that operate on audience semantics can recognize the values of audienceEducation as indicating something about the audience. For instance, a process might build an index of content by audience or by platform:

.......... bigiron
.......... handheld
.......... linux
.......... macintosh
.......... macosx
.......... mainframe
.......... openserver
.......... pc
.......... server
.......... windows
.......... wintel

Otherwise, each attribute that has processing that is sensitive to semantics will require a custom process.

Hoping that's useful,

Erik Hennum
ehennum@us.ibm.com

Follow-Ups:
- Re: [dita] attribute specialization as foundation
  - From: Dana Spradley <dana.spradley@oracle.com>