FW: FW: FW: [dita] indexing question

Everyone,

Please review the discussion that has occurred around indexing. We need to make this the major topic of next week’s discussion. Everyone seems to be confused about the concerns of indexing on translation.

JoAnn

JoAnn T. Hackos, PhD
President
Comtech Services, Inc.
710 Kipling Street, Suite 400
Denver, CO 80215
303-232-7586
joann.hackos@comtech-serv.com
joannhackos Skype

www.comtech-serv.com

From: Erik Hennum [mailto:ehennum@us.ibm.com]
Sent: Tuesday, July 18, 2006 9:07 AM
To: JoAnn Hackos
Cc: Chris Wong; dita@lists.oasis-open.org; Grosso, Paul
Subject: Re: FW: FW: [dita] indexing question

Hi, JoAnn and Bruce:

Regardless of whether an index entry is a point (as Chris suggests) or a span (an alternative view), an index entry clearly should never have an impact on flow. An index entry is an annotation on the content much like a metadata property.

So, I would disagree with the recommendation to treat the index entry as an inline if there is any implication of affecting the layout or the parsing of text. An index entry could appear in the middle of a word -- it shouldn't make any difference in the processing.

Regarding interpretation of an index entry based on its container, that applies to an index entry in the prolog. It should be interpretted based on the topic (which is the effective container of everything in the prolog). In particular, for an index entry in the prolog, the index is either a point attached to the start of the topic or a span covering the entire topic.

Even if the processing of index entries _were_ different in different contexts, I don't see that this would necessarily requires a different element name so long as the processing conforms to expectations.

On the range questions, we should keep in mind concerns about topic reuse. If we embed index start and end entries in different topics, we run the risk of breaking the range when the start and end topics are reused independently. That's part of the rationale for putting ranges in the map.

Thanks,

Erik Hennum
ehennum@us.ibm.com

"JoAnn Hackos" <joann.hackos@comtech-serv.com>

"JoAnn Hackos" <joann.hackos@comtech-serv.com>

07/18/2006 07:24 AM

<dita@lists.oasis-open.org>

"Grosso, Paul" <pgrosso@ptc.com>, "Chris Wong" <cwong@idiominc.com>

Subject

FW: FW: [dita] indexing question

Here is the suggestion I received from Bruce in response to Rodolfo’s clarification. I’m not certain that everyone has seen it.

JoAnn T. Hackos, PhD
President
Comtech Services, Inc.
710 Kipling Street, Suite 400
Denver, CO 80215
303-232-7586
joann.hackos@comtech-serv.com
joannhackos Skype
www.comtech-serv.com

From: Esrig, Bruce (Bruce) [mailto:esrig@lucent.com]
Sent: Monday, July 17, 2006 1:23 AM
To: JoAnn Hackos
Subject: RE: FW: [dita] indexing question

Hi JoAnn,

Every two weeks I have a meeting at 10 Eastern that runs for an hour or two. This is one of those weeks, and it could be a long meeting. So I suspect I won't be able to attend the translation SC meeting this week.

1. Thanks to Rodolfo for explaining "breaking" and the related terms. Is he willing to have his message posted in part or whole to the main DITA list? May I distribute it within Lucent?

2. I agree that treating <indexterm> as a subflow in the prolog and as an inline elsewhere is best among the alternatives presented. Would the translation SC be opposed to specifying that <indexterm> is filtered on the way to/from TM in order to distinguish an <indexterm> that is to be treated as a subflow from an <indexterm> that is to be treated as an inline? This could be done by creating two up to two artificial elements <indextermsubflow> and <indexterminline> that are used only in the TM processing.

If it were possible to distinguish between subflow and inline uses of indexterm, then DITA could also offer the following enhancement: add an attribute to suppress printing in inline contexts, such as <indexterm print="no">. This takes advantage of the ability to distinguish between a subflow and an inline. If print="no" is specified, then in an inline context, the <indexterm> would be treated as a subflow.

3. The translation SC might wish (especially if the filtering proposal is not feasible) to recommend a special element <indextermprolog>. The default treatment of <indexterm> would be as an inline, but <indextermprolog> would be treated as a subflow. Since DITA 1.1 is expected to be backward compatible, <indextermprolog> could be an optional alternative to <indexterm> in prolog contexts in DITA 1.1. Subsequently, <indextermprolog> could become the standard element for use in prolog contexts. This approach would still leave room for the print="no" enhancement.

4. I'm delighted that Rodolfo separates out the issue of multiple-topic ranges. If needed, the translation SC could still discuss for approval or disapproval the point of view that ... those groups that want to support index ranges that span multiple processes will have to take responsibility for ensuring that their translation processes support it. For example, such groups could extract their index range data in advance, translate it in advance, and submit the translated data with the main body of material to be translated.

Best wishes,

Bruce Esrig

-----Original Message-----
From: JoAnn Hackos [mailto:joann.hackos@comtech-serv.com]
Sent: Sunday, July 16, 2006 10:49 PM
To: esrig@lucent.com
Subject: FW: FW: [dita] indexing question
Bruce,
I think you'll find Rodolfo's email clarifies the action of the translation tools.We're meeting on this tomorrow. Please send me comments if you cannot attend.
JoAnn

JoAnn T. Hackos, PhD
President
Comtech Services, Inc.
710 Kipling Street, Suite 400
Denver CO 80215
303-232-7586
joann.hackos@comtech-serv.com

From: Rodolfo M. Raya [mailto:rodolfo@heartsome.net]
Sent: Sunday, July 16, 2006 7:41 AM
To: JoAnn Hackos
Cc: Andrzej Zydron
Subject: Re: FW: [dita] indexing question
On Sat, 2006-07-15 at 15:33 -0600, JoAnn Hackos wrote:

I thought I would forward one of the recent threads regarding indexing in the TC. Do you see any potential problems with regard to translation in these proposals? Please let me know more about the issue with index terms in the prolog pointing to the topic and index terms in block elements. I believe you referred to the first instance as a breaking element and the second as a subflow. Is it possible to get a definition and example of a breaking element and a subflow?

Hi Joann,

Let me start with an explanation on element types (classified from translation tools point of view). Consider this XML fragment:

<table>
<row>
<col>
Segment one.
</col>
<col>
Segment two. Second sentence.
Segment with bold text.
Segment with <footnote>some comment</footnote> footnote.
</col>
<col>
</col>
</row>
</table>

Six segments can be extracted for translation from the example:

1. Segment one.
2. Segment two.
3. Second sentence.
4. Segment with «1»bold«2» text.
5. Segment with «1» footnote.
6. some comment

We can classify the elements present in the fragment as:

Breaking	Elements that contain text fragments that should be analysed as a unit. A new segment should be created whenever this kind of element is found at text extraction time. In CAT tools maker jargon, it "breaks" the segment being processed and starts a new one.	<p>
Inline	Elements that delimit text fragments that should be analysed as part of the text from the parent element. These elements usually delimit changes in style.	<b>
Subflow	Elements that contain text fragments that should be analysed separately. Processing of the enclosing segment does not end. The element is replaced by a marker in the text and its processing is delayed.	<footnote>
Ignorable	Elements that are not supposed to contain translatable text and can be discarded, except when they appear as children of breaking elements in which case they should be regarded as "inline".	<table>, <row>, <col>

In the example given above, the element is considered a "breaking" element because it encloses text that should be extracted as a unit. Notice that a element may contain several sentences and require additional processing based on grammar rules that are independent from XML markup (see items 2. and 3. in the list of segments).

The element is considered "inline" because it does not contain text that needs to be translated on its own. The text from this element is supposed to belong to a bigger fragment. The XML markup of inline elements is irrelevant to translators and it is replaced by "tags" in the extracted text («1» and «2» in item 4. of list of segments).

The element <footnote> contains text that can be considered a translation unit on its own. Its content is related to the enclosing text, but it isn't part of the enclosing text. At extraction time the content of a "subflow" element is placed in its own segment and a "tag" is added in the segment that contains the original context to mark the location of the material that has been separated (see items 5 and 6 from the list).

When we were discussing last Monday, I initially believed that <indexterm> was considered an "inline" element. Near the end of the talk someone clarified that <indexterm> is a "subflow" element.

Let me try to explain with examples what would be wrong if <indexterm> is always treated as "inline" or "subflow" element.

<topic>
<prolog>
<indexterm>term one</indexterm>
<indexterm>term two</indexterm>
</prolog>
<body>
Paragraph that contains <indexterm>term one</indexterm>
and <indexterm>term two</indexterm> inside.
</body>
</topic>

A) If <indexterm> is treated as "inline", we get these segments after text extraction:

1. «1»term one«2»«3»term two«4»
2. Paragraph that contains «1»term one«2» and «3»term two«4» inside.

In this case, the translation of segment 1 cannot be reused for translating segment 2.

B) If <indexterm> is considered a "subflow" element, we will get these strings:

1. term one
2. term two
3. Paragraph that contains «1» and «2» inside.
4. term one
5. term two

In this case, translation of segment 3 becomes complicated because the sentence lacks relevant portions.

C) If we treat <indexterm> as "subflow" or "breaking" when it is a child of <prolog> and as "inline" anywhere else, we get these strings::

1. term one
2. term two
3. Paragraph that contains «1»term one«2» and «3»term two«4» inside.

In this case, translations of segments 1 and 2 can be reused as terminology entries when translating segment 3.

In my opinion, case C) is the best one. If stating that an element should be classified differently according to context is difficult (and I guess it is), then case A) should be considered as more reasonable alternative.

Finally, the discussion of indexes and page ranges that happened in the main DITA list is irrelevant from translation point of view. It doesn't matter if an index covers one or more topics/pages.

Best regards,
Rodolfo

--
The information in this e-mail is intended strictly for the addressee, without prejudices, as a confidential document. Should it reach you, not being the addressee, it is not to be made accessible to any other unauthorised person or copied, distributed or disclosed to any other third party as this would constitute an unlawful act under certain circumstances, unless prior approval is given for its transmission. The content of this e-mail is solely that of the sender and not necessarily that of Heartsome.

dita-translation message