dita message

Subject: Indexterm: page ranges

From: "Chris Wong" <cwong@idiominc.com>
To: "DITA-TC \(E-mail\)" <dita@lists.oasis-open.org>
Date: Mon, 3 Oct 2005 17:35:29 -0400

Thanks to all the TC for responding at length. I'd like to respond here at the very first topic, Erik's "GENERAL" heading on an indexterm covering a unit of content. My difficulty in seeing indexterms this way -- apart from the fact that this is not how readers and authors would see them -- is that XML must be well formed. There can be only one hierarchy active. On the other hand, index entries can reflect completely orthogonal organizations. You can have index entries that overlap/straddle each other or their parent nodes. There is no reason to assume that an index entry range can exist within well-formed XML.

Indeed, an index range that merits its own container may face an ontological problem: according to Microsoft's manual of style, it should not exist. Such a sustained discussion can merit its own topic or should otherwise belong only in the table of contents. If it is part of the overall document structure, it probably is a candidate for the TOC, not the index. Readers use the index for other information.

Here's a concrete example. Suppose I wrote a task on how to change my car's spark plugs. The sequence goes something like:

I talk about gapping and prepping the new spark plugs here. I describe how to use the anti-seize compound in loving detail.
I talk about removing the old spark plugs. I describe use of my socket extension and then my torque wrench.
I talk about inserting the new spark plugs here. I caution about getting anti-seize compound in the wrong places. I mention my socket extension again.
I describe tightening the new spark plugs using my torque wrench in excruciating detail.

Suppose I want my reader to be able to look up where auto tools are used in my new masterpiece "Auto Misrepair for Dummies" book that incorporates this task. Using pseudo-XML notation, the relevant index entry ranges go like:

<anti-seize compound><socket extension><torque wrench>

</anti-seize compound></socket extension></torque wrench>

Apart from the fact that these ranges completely overlap each other, they also cross the task <step> element boundaries and child elements of <step>: cmd, info, substeps, tutorialinfo etc. Human languages are such annoyingly undisciplined things. That is why I felt compelled to propose page range start/end markers outside of the XML structure.

There are few other things I want to cover from the discussion on page ranges:

A page range does not imply that the entry is the primary entry. It only implies length. Otherwise, an entry that contains many page-range references cannot tell us which one is primary. People sometimes indicate primary entries by setting the page number reference in bold. My colleague uses an entry like "XYZ, About" to similarly indicate it is primary. I did not address the ability to indicate a primary entry in the original proposal: is this a desirable feature apart from the page range issue?
Page ranges do not merely mean multiple occurrences of the term. The Chicago Manual of Style distinguishes between a continued discussion (e.g., 34-36) and individual references on a sequence of pages (e.g., 34, 35, 36). The ability to combine index entry references is not a substitute for explicit page ranges.
I understand the concerns regarding topic-spanning indexterms. I would like to point out that the current proposal disallows page range markers from starting in one topic and ending in another. For topic spanning, it mentions using indexterms at the map level and coalescing adjacent topics' indexterms. Would people be comfortable with a proposal that only allows the map-level method of spanning topics (i.e., jettisoning the latter alternative)? I'm talking about Erik Hennum's description of using the start/end range markers in a map's topicref's <topicmeta> element.

Chris

-----Original Message-----
From: Erik Hennum [mailto:ehennum@us.ibm.com]
Sent: Friday, September 30, 2005 12:39 AM
To: Chris Wong
Cc: dita@lists.oasis-open.org
Subject: RE: [dita] Groups - DITA 1.1 Issue #45: Add See, See Also indexing elements (IssueNumber45.html) uploaded

Hi, Chris:

Interesting issues...

GENERAL. Fundamentally, what is an index term? In a topic architecture, I'd submit that we should regard an index term as a semantic label attached to a unit of content (such as a phrase, paragraph, list, table, section, topic, or collection of topics). We should not regard an index term as attached to a point within a discourse flow because a point doesn't have any meaning.

The following example

<p>...<indexterm>Application servers</indexterm>...</p>

declares

"This paragraph is about application servers."

That's true regardless of where the index term appears within the paragraph. To indicate that the index term applies only to a sentence, the writer could wrap a <ph> element around the indexed sentence. That is, the container of the index term defines the unit of content that's about application servers.

PAGE RANGES. From that perspective, we shouldn't need start and end markers for a range. By definition, the container specifies the range for the indexed unit of content. (For an index marker within a prolog, the effective container is the topic.)

A formatter might apply the rule that, if the container spans more than one or two pages (or some threshhold controlled by a style policy), the generated index shows a page range. Otherwise, the formatter emits the start page for the container.

That way, the writer doesn't have to maintain page ranges depending on the output. If the writer starts with an index marker on a section but adds content to the section until it stretches to three pages (shudder), the writer doesn't have to change the index marker to start and end markers. If the section fits on a single page when output as 8 1/2 by 11 but flows over three pages when output as A5 (or whatever), the writer doesn't have to revise the topic depending on the output.

In the implementation, during the topic merge phase, the preprocessor could insert processing instructions at the start and end of the container if convenient for easy processing of the range.

If you find yourself wanting to index a range of content that's a subset of a container, you should ask yourself whether the content merits a container. That is, requiring that semantic units have containers is consistent with the topic-oriented approach of assembling larger structures from small, granular, typed units of content.

In passing, the same ambiguity that came up for the <data> element rears its ugly head here. If I put an index markers within a topicmeta for a topicref, should the range be the referenced topic or the entire branch of the map? Do we need a systematic way to distinguish the properties of the referenced topic from the properties of the referencing collection?

SEE vs SEE ALSO. I'm wondering if we could produce both outputs correctly from a single element that expresses synonyms for index terms. As I understand the publishing convention for "see" and "see also," the correct tag depends on which terms have instances:

If both the source term and target term for the synonym have instances, the formatter should generate a "see also" on the source.
If only the target term for the synonym has instances, the formatter should generate a "see" on the source.
If the target term for the synonym doesn't have instances, the formatter should ignore the synonym (and potentially generate a warning).

In other words, the same synonym might be a "see" or "see also" or nothing, depending on whether the aggregating map has assembled topics that have instances of the source and target term.

GLOBAL SORTS AND SYNONYMS. I'd submit that it should be possible to declare sort keys and see / see also synonyms as global definitions rather than definitions associated with specific instances.

After all, what if an index term has a sort key in one instance and either no sort key or a different sort key in another instance of the index term?

Also, when the output is generated, a see / see also synonym applies to every instance of the index term rather than to a specific instance. Finally, the most typical reason for defining synonyms is to identify related content. Because the map controls the assembly of content, synonyms would sensibly be as aspect of assembly.

Perhaps it would make sense to define sorts and synonyms within the <keywords> element. That way, the common case (global definitions of sorts and synonyms) is easy, the edge case (content that requires sorts or synonyms) is awkward but possible, and index terms embedded within content don't provide a bulky distraction from the discourse flow.

Maybe something like the following:

<map>

<topicmeta>

<keywords>

...

<index-definitions>

<sort-term>

<term>The Jabberwocky</term>

<sort-as>

<term>Jabberwocky</term>

</sort-as>

</sort-term>

...

<index-synonym>

<indexterm>The Jabberwocky

<indexterm>habitat of</indexterm>

</indexterm>

<index-related>

<indexterm>Travel destinations</indexterm>

</index-related>

</index-synonym>

...

</index-definitions>

</keywords>

</topicmeta>

...

</map>

In passing, the keyref proposal (#40) should make it possible to index the topic content but assign the labels to those index terms in the map. Producing a good index often requires adjusting the labels based on the labels of the other indexed content. Having to go back into the content to align index terms is an enormous pain and an inhibiter for reuse -- especially if you'd like to freeze the content but perform final production on the index.

MISCELLANEOUS. I'd agree with Paul that, with <indexterm> (as with <section>), there's an implied structure on the content that can only be validated by XML parser when the grammar can impose constraints on mixed content models. Regarding linking, a generated index in HTML or PDF output should have links to the instances of the index terms. I suppose the instances of a term could sensibly link to one another as a convenience (if the hotspot isn't too distracting).

What do you think?

Erik Hennum
ehennum@us.ibm.com

"Chris Wong" <cwong@idiominc.com> wrote on 09/28/2005 09:15:28 AM: > I'm kind of surprised to see no questions or objections so far to > this proposal. I hear that people can have strong opinions about > this subject. I'd like to see any debate get underway so we will > have time to move this issue forward. Anyone?

Follow-Ups:
- Re: [dita] Indexterm: page ranges
  - From: Erik Hennum <ehennum@us.ibm.com>