OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

dita message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: FW: FW: [dita] indexing question


JoAnn was able to get a very helpful clarification from Rodolfo Raya.
 
Rodolfo concentrated on clarifying the impact on translation memory of allowing index terms in both the prolog and the content.
 
The index terms in the content are best modeled as inline information.
The index terms in the prolog are best modeled as a subflow.
 
Rodolfo takes the point of view that translation would not raise any issues with ranges across topics.
 
Best wishes,
 
Bruce Esrig
 


From: Rodolfo M. Raya [mailto:rodolfo@heartsome.net]
Sent: Sunday, July 16, 2006 7:41 AM
To: JoAnn Hackos
Cc: Andrzej Zydron
Subject: Re: FW: [dita] indexing question

On Sat, 2006-07-15 at 15:33 -0600, JoAnn Hackos wrote:
I thought I would forward one of the recent threads regarding indexing in the TC. Do you see any potential problems with regard to translation in these proposals? Please let me know more about the issue with index terms in the prolog pointing to the topic and index terms in block elements. [Esrig, Bruce (Bruce)] ... Is it possible to get a definition and example of a breaking element and a subflow?

Hi Joann,

Let me start with an explanation on element types (classified from translation tools point of view). Consider this XML fragment:

<table>
  <row>
    <col>
      <p>Segment one.</p>     
    </col>
    <col>
      <p>Segment two. Second sentence.</p>
      <p>Segment with <b font="Times">bold</b> text.</p>
      <p>Segment with <footnote>some comment</footnote> footnote.</p>
    </col>
    <col>
    </col>
  </row>
</table>


Six segments can be extracted for translation from the example:

  1. Segment one.
  2. Segment two.
  3. Second sentence.
  4. Segment with «1»bold«2» text.
  5. Segment with «1» footnote.
  6. some comment

We can classify the elements present in the fragment as:

Breaking Elements that contain text fragments that should be analysed as a unit. A new segment should be created whenever this kind of element is found at text extraction time. In CAT tools maker jargon, it "breaks" the segment being processed and starts a new one. <p>
Inline Elements that delimit text fragments that should be analysed as part of the text from the parent element. These elements usually delimit changes in style. <b>
Subflow Elements that contain text fragments that should be analysed separately. Processing of the enclosing segment does not end. The element is replaced by a marker in the text and its processing is delayed. <footnote>
Ignorable Elements that are not supposed to contain translatable text and can be discarded, except when they appear as children of breaking elements in which case they should be regarded as "inline". <table>, <row>, <col>


In the example given above, the element <p> is considered a "breaking" element because it encloses text that should be extracted as a unit. Notice that a <p> element may contain several sentences and require additional processing based on grammar rules that are independent from XML markup (see items 2. and 3. in the list of segments).

The element <b> is considered "inline" because it does not contain text that needs to be translated on its own. The text from this element is supposed to belong to a bigger fragment. The XML markup of inline elements is irrelevant to translators and it is replaced by "tags" in the extracted text («1» and «2» in item 4. of list of segments).

The element <footnote> contains text that can be considered a translation unit on its own. Its content is related to the enclosing text, but it isn't part of the enclosing text. At extraction time the content of a "subflow" element is placed in its own segment and a "tag" is added in the segment that contains the original context to mark the location of the material that has been separated (see items 5 and 6 from the list).

When we were discussing last Monday, I initially believed that <indexterm> was considered an "inline" element. Near the end of the talk someone clarified that <indexterm> is a "subflow" element.

Let me try to explain with examples what would be wrong if <indexterm> is always treated as "inline" or "subflow" element.

<topic>
  <prolog>
    <indexterm>term one</indexterm>
    <indexterm>term two</indexterm>
  </prolog>
  <body>
    <p>Paragraph that contains <indexterm>term one</indexterm> 
       and <indexterm>term two</indexterm> inside.</p>
  </body>
</topic>


A) If <indexterm> is treated as "inline", we get these segments after text extraction:

  1. «1»term one«2»«3»term two«4»
  2. Paragraph that contains «1»term one«2» and «3»term two«4» inside.

In this case, the translation of segment 1 cannot be reused for translating segment 2.

B) If <indexterm> is considered a "subflow" element, we will get these strings:

  1. term one
  2. term two
  3. Paragraph that contains «1» and «2» inside.
  4. term one
  5. term two

In this case, translation of segment 3 becomes complicated because the sentence lacks relevant portions.

C) If we treat <indexterm> as "subflow" or "breaking" when it is a child of <prolog> and as "inline" anywhere else, we get these strings::

  1. term one
  2. term two
  3. Paragraph that contains «1»term one«2» and «3»term two«4» inside.

In this case, translations of segments 1 and 2 can be reused as terminology entries when translating segment 3.

In my opinion, case C) is the best one. If stating that an element should be classified differently according to context is difficult (and I guess it is), then case A) should be considered as more reasonable alternative.

Finally, the discussion of indexes and page ranges that happened in the main DITA list is irrelevant from translation point of view. It doesn't matter if an index covers one or more topics/pages.

Best regards,
Rodolfo
--
The information in this e-mail is intended strictly for the addressee, without prejudices, as a confidential document. Should it reach you, not being the addressee, it is not to be made accessible to any other unauthorised person or copied, distributed or disclosed to any other third party as this would constitute an unlawful act under certain circumstances, unless prior approval is given for its transmission. The content of this e-mail is solely that of the sender and not necessarily that of Heartsome.


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]