OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

dita message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: Re: [dita] Question about how to define equivalent index entries


- Two index terms with the same base text and different sort-as values must be an error and the processor can recover as it chooses. That is, it cannot be sensible to have the same base term presented in two different places in the index, so it must be author error. Either they used the incorrect sort key or they mean to use a different base term with an index-see

I'm not sure this is obviously the case.

Consider "scope" as an index-term base text, where (depending on context) it can be a reference to "oscilloscope", "endoscope", "microscope", or "project scope".  It might be better to always write the full term out in the base text in that case, but I don't thing we can hope to require that.  A sufficiently large content delivery with a unified index might wish to be able to split out the various specific meanings of a short base text using the sort-as value.

Also consider translation, where the sort-as value retains the authoring language term but the translation won't necessarily keep two authoring language terms distinct.

A translated-to-Simplified-Chinese base text of â could retain a sort-as value of "sheep" or "goat", depending; a text authored in Swedish could retain sort-as values of "faster" and "moster" while the in-English-translation index-term base text says "aunt" in both cases.

These are artificial examples but this is the sort of issue which does come up in localization; what is two distinct words in the authoring language isn't in the target language, and the indexing has to cope somehow.

Graydon Saunders | Publishing Solutions Developer | Precision Content 
Direct: +1
 (647)265-8500 x106Email: graydon@precisioncontent.com | www.precisioncontent.com

 


 

Unlock the Knowledge in Your Enterpriseâ


This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you are not the intended recipient you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited. Please notify us by return email if you have received this email in error. Â2019, Precision Content Authoring Solutions Inc, Mississauga, Ontario, Canada


From: dita@lists.oasis-open.org <dita@lists.oasis-open.org> on behalf of Eliot Kimber <ekimber@contrext.com>
Sent: 13 August 2019 09:52
To: Robert D Anderson <robander@us.ibm.com>; dita@lists.oasis-open.org <dita@lists.oasis-open.org>
Subject: Re: [dita] Question about how to define equivalent index entries
 
I agree that we have to allow processor to do whatever they want with index process but I think it's also reasonable to specify an expected set of behaviors for "normal" index processing.

I would have those include:

- Index terms are compared by normalizing white space and preserving case
- The use of <sort-as> does not affect the merging of index entries for index entries that have the same sort-as value. This is because the sort-as value should be *combined* with the base index term text to construct the complete sort key.
- Two index terms with the same base text and different sort-as values must be an error and the processor can recover as it chooses. That is, it cannot be sensible to have the same base term presented in two different places in the index, so it must be author error. Either they used the incorrect sort key or they mean to use a different base term with an index-see
- The merging of index entries where one entry is only a primary term and others are the primary term with secondary terms is processor dependent and processors should be encouraged to provide options for how to handle this case: separate entries, always merge, report a warning. Which behavior you want is an editorial choice.

In addition, there is the question of whether or not primary entries with secondary entries should be give page numbers or not. This is again an editorial choice that can be controlled either by authoring practice (never have primary-only entries for a term that also has secondary entries) or can be enforced by the processor with exceptions reported as warnings. As an example, Mike Kay's XSLT book's index has page numbers for primary entries that also have secondary entries but the SGML Handbook does not.

Cheers,

E.

--
Eliot Kimber
http://contrext.com
 

ïOn 8/12/19, 2:08 PM, "Robert D Anderson" <dita@lists.oasis-open.org on behalf of robander@us.ibm.com> wrote:

    Eliot raised a point that I think needs wider TC input during his review of the DITA 2.0 indexing content.
   
    Our examples of index entries show how one primary term with two secondary entries are considered equivalent to the same primary term defined twice (once with each secondary term). See figure 2 for the same example in our DITA 1.3 spec:
    http://docs.oasis-open.org/dita/dita/v1.3/errata02/os/complete/part1-base/langRef/base/indexterm.html#indexterm
   
    Eliot pointed out that this reflects some assumptions about how processors must merge index entries, but those rules are never stated. So the question: how precise should the spec be about merging terms?
   
    I ask because in the end, this is really all about rendering -- and processors are free to render an index in all sorts of ways. For example, if I have <indexterm>oops</indexterm> in fifteen topics, I would expect most processors to render that as one index term with 15 links. That said, it would technically be valid for a processor to have fifteen entries for "oops". I don't think we can or should forbid that.
   
    With that in mind - how precise should the specification be when it comes to merging index terms?
   
    For example - how many of these should be rules in the spec? How many should be addressed but explicitly left up to implementations? How many should not be addressed at all?
   
    * Are "oops" and "Oops" equivalent? I would think not, so we can probably say that case sensitivity is important.
    * What if one has a leading or trailing space, and the other does not - is that significant?
    * What if the text content is the same, but one has non-indexterm sub-elements? For example:
    <indexterm>This is odd</indexterm>
    and
    <indexterm>This is <em>odd</em></indexterm>
    * What if one has a secondary term in the middle, and another has it at the end? For example, should we explicitly state that these primary terms are equivalent?
    <indexterm>This is <indexterm>secondary</indexterm> interesting</indexterm>
    and
    <indexterm>This is interesting<indexterm>secondary</indexterm></indexterm>
    * What if two terms have the same sort key? For example, would these all match?
    <indexterm>data</indexterm>
    <indexterm>data<sort-as>data</sort-as></indexterm>
    <indexterm>Data<sort-as>data</sort-as></indexterm>
   
   
    I'm sure there are a lot more edge cases, so that list above is really just to give a taste of the different things we might have to get into if we are exhaustive about "matching".
    Robert D. Anderson
    DITA-OT <https://dita-ot.org/> lead and Co-editor DITA 1.3 specification
    Marketing Services Center________________________________________
    E-mail: robander@us.ibm.com
   
    11501 BURNET RD,, TX, 78758-3400, AUSTIN, USA
   
   
   
   



---------------------------------------------------------------------
To unsubscribe from this mail list, you must leave the OASIS TC that
generates this mail.  Follow this link to all your TCs in OASIS at:
https://www.oasis-open.org/apps/org/workgroup/portal/my_workgroups.php



[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]