dita message

Subject: RE: [dita] Groups - DITA 1.1 Issue #45: Add See, See Also indexing elements (IssueNumber45.html) uploaded

From: "Grosso, Paul" <pgrosso@ptc.com>
To: <dita@lists.oasis-open.org>
Date: Fri, 30 Sep 2005 10:25:18 -0400

While I understand most of what Erik is saying

on a theoretical level, I think we need to keep

in mind how most users think of and use indexterms

in SGML and XML markup for the past 20 years.

And that is, indexterms are points. Changing that

paradigm is going to surprise a lot of users.

And users are more used to giving what Chris calls

"sort-as" when and where they markup the indexterm,

not in some separate sort-map concept.

So while Erik's ideas might be ivory tower nice, I

fear they are too different from what users and

implementors are familiar with. DITA is already

enough of a learning curve for people--too much,

I fear, in many cases--so I'm hesitant to take such

a different track in defining indexterms than is

common in other existing XML markup vocabularies.

paul

From: Erik Hennum [mailto:ehennum@us.ibm.com]
Sent: Thursday, 2005 September 29 23:39
To: Chris Wong
Cc: dita@lists.oasis-open.org
Subject: RE: [dita] Groups - DITA 1.1 Issue #45: Add See, See Also indexing elements (IssueNumber45.html) uploaded

Hi, Chris:

Interesting issues...

GENERAL. Fundamentally, what is an index term? In a topic architecture, I'd submit that we should regard an index term as a semantic label attached to a unit of content (such as a phrase, paragraph, list, table, section, topic, or collection of topics). We should not regard an index term as attached to a point within a discourse flow because a point doesn't have any meaning.

The following example

<p>...<indexterm>Application servers</indexterm>...</p>

declares

"This paragraph is about application servers."

That's true regardless of where the index term appears within the paragraph. To indicate that the index term applies only to a sentence, the writer could wrap a <ph> element around the indexed sentence. That is, the container of the index term defines the unit of content that's about application servers.

PAGE RANGES. From that perspective, we shouldn't need start and end markers for a range. By definition, the container specifies the range for the indexed unit of content. (For an index marker within a prolog, the effective container is the topic.)

A formatter might apply the rule that, if the container spans more than one or two pages (or some threshhold controlled by a style policy), the generated index shows a page range. Otherwise, the formatter emits the start page for the container.

That way, the writer doesn't have to maintain page ranges depending on the output. If the writer starts with an index marker on a section but adds content to the section until it stretches to three pages (shudder), the writer doesn't have to change the index marker to start and end markers. If the section fits on a single page when output as 8 1/2 by 11 but flows over three pages when output as A5 (or whatever), the writer doesn't have to revise the topic depending on the output.

In the implementation, during the topic merge phase, the preprocessor could insert processing instructions at the start and end of the container if convenient for easy processing of the range.

If you find yourself wanting to index a range of content that's a subset of a container, you should ask yourself whether the content merits a container. That is, requiring that semantic units have containers is consistent with the topic-oriented approach of assembling larger structures from small, granular, typed units of content.

In passing, the same ambiguity that came up for the <data> element rears its ugly head here. If I put an index markers within a topicmeta for a topicref, should the range be the referenced topic or the entire branch of the map? Do we need a systematic way to distinguish the properties of the referenced topic from the properties of the referencing collection?

SEE vs SEE ALSO. I'm wondering if we could produce both outputs correctly from a single element that expresses synonyms for index terms. As I understand the publishing convention for "see" and "see also," the correct tag depends on which terms have instances:

If both the source term and target term for the synonym have instances, the formatter should generate a "see also" on the source.
If only the target term for the synonym has instances, the formatter should generate a "see" on the source.
If the target term for the synonym doesn't have instances, the formatter should ignore the synonym (and potentially generate a warning).

In other words, the same synonym might be a "see" or "see also" or nothing, depending on whether the aggregating map has assembled topics that have instances of the source and target term.

GLOBAL SORTS AND SYNONYMS. I'd submit that it should be possible to declare sort keys and see / see also synonyms as global definitions rather than definitions associated with specific instances.

After all, what if an index term has a sort key in one instance and either no sort key or a different sort key in another instance of the index term?

Also, when the output is generated, a see / see also synonym applies to every instance of the index term rather than to a specific instance. Finally, the most typical reason for defining synonyms is to identify related content. Because the map controls the assembly of content, synonyms would sensibly be as aspect of assembly.

Perhaps it would make sense to define sorts and synonyms within the <keywords> element. That way, the common case (global definitions of sorts and synonyms) is easy, the edge case (content that requires sorts or synonyms) is awkward but possible, and index terms embedded within content don't provide a bulky distraction from the discourse flow.

Maybe something like the following:

<map>
<topicmeta>
<keywords>
...
<index-definitions>

<sort-term>
<term>The Jabberwocky</term>
<sort-as>
<term>Jabberwocky</term>
</sort-as>
</sort-term>
...
<index-synonym>
<indexterm>The Jabberwocky
<indexterm>habitat of</indexterm>
</indexterm>

<index-related>
<indexterm>Travel destinations</indexterm>
</index-related>
</index-synonym>
...
</index-definitions>
</keywords>
</topicmeta>
...
</map>

In passing, the keyref proposal (#40) should make it possible to index the topic content but assign the labels to those index terms in the map. Producing a good index often requires adjusting the labels based on the labels of the other indexed content. Having to go back into the content to align index terms is an enormous pain and an inhibiter for reuse -- especially if you'd like to freeze the content but perform final production on the index.

MISCELLANEOUS. I'd agree with Paul that, with <indexterm> (as with <section>), there's an implied structure on the content that can only be validated by XML parser when the grammar can impose constraints on mixed content models. Regarding linking, a generated index in HTML or PDF output should have links to the instances of the index terms. I suppose the instances of a term could sensibly link to one another as a convenience (if the hotspot isn't too distracting).

What do you think?

Erik Hennum
ehennum@us.ibm.com

"Chris Wong" <cwong@idiominc.com> wrote on 09/28/2005 09:15:28 AM: > I'm kind of surprised to see no questions or objections so far to > this proposal. I hear that people can have strong opinions about > this subject. I'd like to see any debate get underway so we will > have time to move this issue forward. Anyone?