OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

dita message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]

Subject: RE: [dita] index term theory

This is a mixed response to the messages from JoAnn Hackos and Erik Hennum.

My apologies in advance in case I summarize their ideas incorrectly in the course of this response.

Bruce Esrig


We need to use clear terminology to distinguish:
 - an index term occurrence in the source
 - an index entry in a generated index

JoAnn's notion of a controlled vocabulary term is helpful. In JoAnn's treatment, whether a term is a controlled vocabulary term is determined by looking in the generated index: if other terms point to it using "See", then the target is a controlled vocabulary term. The network of "See" pointers is well-formed only if no target of a "See" also has a "See".

JoAnn also gives distinguished status to the source of a "See also". The source of a "See also" should be a controlled vocabulary term. To permit modeling of related controlled terms, the source of a "See also" may also be the target of a "See also".

On page ranges and index term linking (to the indexterm instance), it may be necessary to use the location of the indexterm instance to determine how to do the processing. A page range (or link to the beginning of a topic in topic-delivery contexts) could be generated for indexterms in <keywords>, but a specific page (or location on the page in topic-delivery contexts) could be generated for indexterms in the <body>. Indexterm linking could be to the topic for indexterms in <keywords>, but to a specific location for indexterms in the <body>.

Index sort information is required if translated indexes are to be generated automatically. It is possible that index sorting should be done by entering data in an attribute rather than entering it in an element. Can we assume that each index entry needs only one index sort entry? In support of this assumption, consider the argument that in a multilingual document, each index entry would be repeated once for each language.

Erik Hennum was interested in a way to specify that an indexterm actually applies to an enclosing element. Perhaps this could be done through an optional attribute that gives a conref-like reference to the element whose scope is covered by the indexterm. This would be necessary if indexterms could have scope other than a point or an entire topic. The reason that it would be necessary is the lack of a <keywords> or equivalent container for any element other than <topic>. Erik anticipated a similar difficulty for the <data> element. For the <data> element there is a conref-like attribute that serves to identify the scope to which the <data> element applies. The idea of defaulting to the next larger scope is an expedient in case the conref-like attribute is considered cumbersome.

In environments that deliver a compilation such as a book, there could be two conref-like attributes to mark the limits of a range. The range would be from the beginning of the first element referenced to the end of the last element referenced. It is hard to make ranges robust, however. If either end of the range is missing, the range becomes invalid and would not be reported in the generated index. The same would be true for Chris Wong's current method, which uses an element to flag the beginning or end of a range. Similarly, if the processing could not resolve the beginning and end of the range to page numbers, the page range would have to be dropped.

Erik's suggestion of centralizing the sort keys is a good practice that might be achieved by conref-ing the indexterms from a master file. This would be a good practice anyway in case terminology changes.

The idea of linking the instances of a term is very interesting ... although a ring of index instances of the same term might be difficult to establish (and especially difficult to maintain in case of dynamic topic-level updates to a collection). Otherwise, a back-link from the index instance to the index entry in the generated index would permit a star-topology traversal of all the possibilities, although in a context that supports history, that could be simulated using a Back button.

-----Original Message-----
From: JoAnn Hackos [mailto:joann.hackos@comtech-serv.com]
Sent: Friday, September 30, 2005 11:22 AM
To: Grosso, Paul; dita@lists.oasis-open.org
Subject: RE: [dita] Groups - DITA 1.1 Issue #45: Add See, See Also
indexing elements (IssueNumber45.html) uploaded

Hi Paul and Chris,
I have been writing this the last few days. I was going to add more but
I'll send it out now.


Indexing use case:

Basic index structure:

Primary term
	Secondary term
		Tertiary term

A basic index requires three levels of terms: Primary, secondary,
tertiary. Some indexes may consist of more than three levels but that is
not recommended as a best practice.

"see" index structure:

A "see" index reference is designed to refer the reader for the
controlled vocabulary term used in the text. Typically the index term is
a synonym or is otherwise equivalent to the controlled vocabulary term.
The index typically does not list a page number for the synonym but
refers the reader to the controlled term for the correct page number.

"see also" index structure:

A "see also" index reference is designed to suggest an additional
controlled term in relationship to the target controlled term. The
target controlled term does include a page reference. Typically you
don't mix see and see also structures. The see also reference should
occur with the target index term rather than with the synonym.

"page range" index structure:

Indexers use page ranges to indicate that an important, high-level topic
is covered over a number of pages. Page ranges are applicable to books
rather than HTML or help systems that refer to topics rather than
sections of books. This requirements may be difficult to implement
through a range of topics in a map or a bookmap.

"index sort":
Index sort sequences may vary and cause problems with translations. Many
indexes in languages other than English tend to be incorrectly sorted
because of characters that do not occur in English. The tendency is to
misplace these characters at the end of the sort rather than where they
belong in the minds of the readers of the target language.

"index term linking"

We can look at this in two ways, which reflect best practices in some of
the more sophisticated index tools. First, as an index is being edited,
a best practice is to link the index term and a single page number back
to the actual index term embedded in the text. The reason is to find and
correct the index term (spelling, change level, etc). Second, for an
automated index, you want to be able to go from the index term in the
final rendering to the page in which the indexed content occurs. In help
indexes, that index items go to the topic level but in PDF indexes, the
link should go to the paragraph level or as close to the actual index
term placement as possible.

JoAnn T. Hackos, PhD
Comtech Services, Inc.
710 Kipling Street, Suite 400
Denver CO 80215

-----Original Message-----
From: Grosso, Paul [mailto:pgrosso@ptc.com] 
Sent: Wednesday, September 28, 2005 12:08 PM
To: dita@lists.oasis-open.org
Subject: RE: [dita] Groups - DITA 1.1 Issue #45: Add See, See Also
indexing elements (IssueNumber45.html) uploaded

> -----Original Message-----
> From: Chris Wong [mailto:cwong@idiominc.com] 
> Sent: Wednesday, 2005 September 28 11:15
> To: dita@lists.oasis-open.org
> Subject: RE: [dita] Groups - DITA 1.1 Issue #45: Add See, See 
> Also indexing elements (IssueNumber45.html) uploaded
> I'm kind of surprised to see no questions or objections so 
> far to this proposal. I hear that people can have strong 
> opinions about this subject. I'd like to see any debate get 
> underway so we will have time to move this issue forward. Anyone?
> Download Document:  

There is something about indexterm (irrespective of
this current proposal) that has always concerned me:
its mixed content model.  Is something like:

<indexterm>Top level
  index term content.

allowed (the DTD allows it)?  If so, what are the 
processing expectations?

Also, what are the processing expectations of

<indexterm>Top level
  <indexterm>Nested 1</indexterm>
  <indexterm>Nested 2</indexterm>

(the DTD allows this too)?

More on this particular proposal

What is the suggested content model now for indexterm?
Indexterm already had a mixed content model, but now it
seems even "more mixed" (if such is possible).  Can one
have #PCDATA following <index-sort-as>...</index-sort-as>? 
If there is going to be an index-sort-as, will it always
be the first child element of the indexterm element?

Is one limited to at most one index-see or index-see-also?
If one has an index-see, can one have an index-see-also?
Is the semantic that if one has an index-see, one doesn't
show the page number on the parent indexterm, but otherwise
one does?

We currently have the following content model:

<!ELEMENT indexterm     (%words.cnt;|%indexterm;)*    >

I'm guessing we might want a content model something like:

<!ELEMENT indexterm     ((%words.cnt;)?,
         (index-see | index-see-also+)? , indexterm?) >

except you can't do that in XML, so we're probably going
to have to allow just a big mash of text and tags, and
write "application semantics" that say it's only dita-valid
if it matches the above non-XML content model.  Regardless,
the proposal needs to describe what is valid input and how
to handle all possible input.

The entire discussion of "linking to other indexterms"
confuses me.  I don't see any linking to indexterms.
There are just indexterms scattered throughout the content,
and when the index is automatically generated, entries
therein pick up the appropriate page numbers and possibly
link to the point in the result where the indexterm element
was found, but there are no links to the indexterms.
Perhaps it's just the wording that confuses me, but it
makes no sense to me to say, for example:

  ...the reference to "Goldfish feeding" points to a
  nested indexterm.  We need to define an identifier
  that a redirection element such as index-see can use
  to point to something yet to be generated. //I don't understand this
either, JoAnn//

Page ranges make me nervous.  They are difficult
to implement correctly, and they are easy to use
incorrectly.  Especially given that <index-range-start/>
and <index-range-end/> are unpaired singleton tags,
it's easy for a user to use them in ways that aren't
going to be valid.

I'm not sure what user requirement is being addressed
by ranges.  Is it just to be able to get something like
46-49 in the index, or is it to allow a user to just
indicate a startpoint and endpoint in the source without
having to insert individual indexterm elements on each page?
The former is just an implementation issue and shouldn't
drive our markup, but I can see the point of the latter.
But we do have to ask, then, if the benefit of this is
enough to offset the problems.
//this second case is not one I've ever seen in any markup for indexes,

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]