Re: [dita] Question about how to define equivalent index entries

Graydon Saunders | Publishing Solutions Developer | Precision Content
Direct: +1 (647)265-8500 x106| Email: graydon@precisioncontent.com | www.precisioncontent.com

Unlock the Knowledge in Your Enterpriseâ

This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you are not the intended recipient you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited. Please notify us by return email if you have received this email in error. Â2019, Precision Content Authoring Solutions Inc, Mississauga, Ontario, Canada

I agree that we have to allow processor to do whatever they want with index process but I think it's also reasonable to specify an expected set of behaviors for "normal" index processing.

I would have those include:

- Index terms are compared by normalizing white space and preserving case
- The use of <sort-as> does not affect the merging of index entries for index entries that have the same sort-as value. This is because the sort-as value should be *combined* with the base index term text to construct the complete sort key.
- Two index terms with the same base text and different sort-as values must be an error and the processor can recover as it chooses. That is, it cannot be sensible to have the same base term presented in two different places in the index, so it must be author error. Either they used the incorrect sort key or they mean to use a different base term with an index-see
- The merging of index entries where one entry is only a primary term and others are the primary term with secondary terms is processor dependent and processors should be encouraged to provide options for how to handle this case: separate entries, always merge, report a warning. Which behavior you want is an editorial choice.

In addition, there is the question of whether or not primary entries with secondary entries should be give page numbers or not. This is again an editorial choice that can be controlled either by authoring practice (never have primary-only entries for a term that also has secondary entries) or can be enforced by the processor with exceptions reported as warnings. As an example, Mike Kay's XSLT book's index has page numbers for primary entries that also have secondary entries but the SGML Handbook does not.

Cheers,

E.

--
Eliot Kimber
http://contrext.com

ïOn 8/12/19, 2:08 PM, "Robert D Anderson" <dita@lists.oasis-open.org on behalf of robander@us.ibm.com> wrote:

    Eliot raised a point that I think needs wider TC input during his review of the DITA 2.0 indexing content.

    Our examples of index entries show how one primary term with two secondary entries are considered equivalent to the same primary term defined twice (once with each secondary term). See figure 2 for the same example in our DITA 1.3 spec:
    http://docs.oasis-open.org/dita/dita/v1.3/errata02/os/complete/part1-base/langRef/base/indexterm.html#indexterm

    Eliot pointed out that this reflects some assumptions about how processors must merge index entries, but those rules are never stated. So the question: how precise should the spec be about merging terms?

    I ask because in the end, this is really all about rendering -- and processors are free to render an index in all sorts of ways. For example, if I have <indexterm>oops</indexterm> in fifteen topics, I would expect most processors to render that as one index term with 15 links. That said, it would technically be valid for a processor to have fifteen entries for "oops". I don't think we can or should forbid that.

    With that in mind - how precise should the specification be when it comes to merging index terms?

    For example - how many of these should be rules in the spec? How many should be addressed but explicitly left up to implementations? How many should not be addressed at all?

    * Are "oops" and "Oops" equivalent? I would think not, so we can probably say that case sensitivity is important.
    * What if one has a leading or trailing space, and the other does not - is that significant?
    * What if the text content is the same, but one has non-indexterm sub-elements? For example:
    <indexterm>This is odd</indexterm>
    and
    <indexterm>This is <em>odd</em></indexterm>
    * What if one has a secondary term in the middle, and another has it at the end? For example, should we explicitly state that these primary terms are equivalent?
    <indexterm>This is <indexterm>secondary</indexterm> interesting</indexterm>
    and
    <indexterm>This is interesting<indexterm>secondary</indexterm></indexterm>
    * What if two terms have the same sort key? For example, would these all match?
    <indexterm>data</indexterm>
    <indexterm>data<sort-as>data</sort-as></indexterm>
    <indexterm>Data<sort-as>data</sort-as></indexterm>


    I'm sure there are a lot more edge cases, so that list above is really just to give a taste of the different things we might have to get into if we are exhaustive about "matching".
    Robert D. Anderson
    DITA-OT <https://dita-ot.org/> lead and Co-editor DITA 1.3 specification
    Marketing Services Center________________________________________
    E-mail: robander@us.ibm.com

    11501 BURNET RD,, TX, 78758-3400, AUSTIN, USA





---------------------------------------------------------------------
To unsubscribe from this mail list, you must leave the OASIS TC that
generates this mail. Follow this link to all your TCs in OASIS at:
https://www.oasis-open.org/apps/org/workgroup/portal/my_workgroups.php

dita message