RE: [dita] index terms

See comments below.

JoAnn

JoAnn T. Hackos, PhD

President

Comtech Services, Inc.

710 Kipling Street, Suite 400

Denver, CO 80215

303-232-7586

joann.hackos@comtech-serv.com

www.comtech-serv.com

From: Erik Hennum [mailto:ehennum@us.ibm.com]
Sent: Friday, September 30, 2005 10:52 PM
To: dita@lists.oasis-open.org
Subject: RE: [dita] index terms

Hi, Esteemed TC:

Before incorporating older publishing approaches in DITA, we should consider whether those approaches support the topic architecture and reuse.

Regarding indexing markers, I'd submit that most writers don't think they're indexing a point. They think they're indexing the content around the index marker.

If we want to be specific about our interpretation of a standalone index marker, a few alternatives are obvious:

Treat the index marker as occuring within a range of indexed content with an unspecified but nearby start and end point.
Treat the index marker as occuring at the start of the range of indexed content where the end of the range is unspecified but nearby.
Treat the container element for the index marker as delimiting the range of indexed content. [I think this is a good alternative. Typically index markers are vaguely associated with a paragraph, heading, or other container element. They may index a particular word in context but they’re not intended to point to that word, at least not in a professional index. Readers want to find information in a text, not words.]

None of the three approaches does violence to the fundamental assumption that the indexed content is around the index marker. Each of the three can lead to surprises for the writer.

Start and end markers, however, pose problems for reuse. Taking up the problem raised by JoAnn, let's say you want to index a range of three topics about web applications and put a start marker at the start of the first topic and an end marker at the end of the second topic:

<topichead "Creating a web storefront">
<topicref "Installing the application server" ... /> 
<topicref "Common security policies for eCommerce" .../>
<topicref "Developing web applications" ... /> 
</topichead>

In another information set, however, the start and end topics are organized in a different way:

<topichead "Developing server applications">
<topicref "Developing web applications" ... /> 
<topicref "Developing database applications" ... />
...
</topichead>
<topichead "Server administration">
<topicref "Configuring LDAP" ... />
<topicref "Installing the application server" .../> 
...
</topichead>

In the second information set, the end marker precedes the start marker. Worse, content completely unrelated to web applications is in the middle of the range. Worst, there's no way to fix the problem for the second deliverable without invalidating the first deliverable.

The problem is architectural: properties that span multiple topics should be specified in the map context and not in the topic content.

We could move the start and end markers into the map itself:

<topichead "Creating a web storefront">
<topicref "Installing the application server" ...>
<topicmeta>
<keywords>
<indexterm>Web applications
<index-range-start/>
</indexterm>
</keywords
</topicmeta>
</topicref>
<topicref "Common security policies for eCommerce" .../>
<topicref "Developing web applications" ...>
<topicmeta>
<keywords>
<indexterm>Web applications
<index-range-end/>
</indexterm>
</keywords>
</topicmeta>
</topicref>
</topichead>

Let's say you add conditional metadata, however, and filter out the start topic, the end topic, or both. It's ambiguous whether to apply the index term to the middle topic. Maybe the middle topic belongs in the indexed range only as part of a sequence including the start and end topics.

More importantly, it is much more natural to leverage the grouping provided by the parent element:

<topichead "Creating a web storefront">
<topicmeta>
<keywords>

<indexterm>Web applications</indexterm>

</keywords>
</topicmeta>
<topicref "Installing the application server" ... />
<topicref "Common security policies for eCommerce" .../>
<topicref "Developing web applications" ... />
</topichead>

Finally, one of the main reasons for tagging is to define semantic units. Why wouldn't we want to take advantage of those semantic units when indexing?

In summary, defining a range with start and end points works better for a single, static discourse flow than for topics that can be organized in many different ways. [I believe Erik has stated the issues correctly here. I wonder if we might define a best practice that does not include ranges, for all the reasons Erik has provided above. The purpose of a page range in an index is to indicate to the reader that the topic is covered more thoroughly there than in other references. A reader would select the page range first because that would indicate a longer discourse than a single page reference. Of course, none of this applies to the way indexes work in help systems; page ranges don’t apply. Perhaps we should not support page ranges at all in a topic architecture but rather provide another way to indicate the “preferred” reference for a topic. I suspect that no one ever looks at the last page of a range but always turns to the first page of the range. The hierarchical arrangement of topics in a map and in the rendering would also indicate a range if the referenced topic is at a higher level than several subsequent topics. If we can add an attribute that indicates a “preferred” or “primary” reference to a subject, that might take care of the reader’s requirement.]

Regarding synonyms, it should at least be possible to maintain associations between controlled vocabularies globally. I've known publications departments (for instance, at Informix) that maintained all of the see synonyms at the end of the introduction because (by definition) a see synonym isn't associated with any particular piece of content. In DITA, however, the map is a much more natural place to maintain definitions that aren't associated with specific content. [How would this work?]

Bruce has a good point about centralizing index labels through conref. Especially if keyrefs can be used in conref, that would seem to meet the requirement for being able to maintain index labels centrally. [I don’t exactly follow. How would the centralized index labels be maintained through conref and keyref?]

Because an index term is always about content, an about-href attribute on indexterm would likely be overkill. Topicref already gives you the ability to attach an index term to a topic. (By the way, in passing, a popup for associative index links might be more useful than a ring of links.)

A last consideration. The <term> and <keyword> elements delimit controlled vocabularies that are embedded in the discourse. Should the writer have to add an index marker to index such instances of controlled vocabularies? Or would we be better off indexing delimited vocabularies (possibly under the control of policies)? [If I’m following this correctly, it might lead to a concordance rather than an index. You do not want to index all instances of a term or keyword but only those that link to relevant information to which the term is a key.]

What do you think?

Erik Hennum
ehennum@us.ibm.com

dita message