Everyone,
Please review the discussion that has
occurred around indexing. We need to make this the major topic of next week’s
discussion. Everyone seems to be confused about the concerns of indexing on
translation.
JoAnn
JoAnn T. Hackos, PhD
President
Comtech Services, Inc.
710 Kipling Street, Suite 400
Denver, CO 80215
303-232-7586
joann.hackos@comtech-serv.com
joannhackos Skype
www.comtech-serv.com
From: Erik Hennum
[mailto:ehennum@us.ibm.com]
Sent: Tuesday, July 18, 2006 9:07
AM
To: JoAnn Hackos
Cc: Chris Wong;
dita@lists.oasis-open.org; Grosso, Paul
Subject: Re: FW: FW: [dita]
indexing question
Hi, JoAnn and Bruce:
Regardless of whether an index entry is a point (as Chris suggests) or a span
(an alternative view), an index entry clearly should never have an impact on
flow. An index entry is an annotation on the content much like a metadata property.
So, I would disagree with the recommendation to treat the index entry as an
inline if there is any implication of affecting the layout or the parsing of
text. An index entry could appear in the middle of a word -- it shouldn't make
any difference in the processing.
Regarding interpretation of an index entry based on its container, that applies
to an index entry in the prolog. It should be interpretted based on the topic
(which is the effective container of everything in the prolog). In particular,
for an index entry in the prolog, the index is either a point attached to the
start of the topic or a span covering the entire topic.
Even if the processing of index entries _were_ different in different contexts,
I don't see that this would necessarily requires a different element name so
long as the processing conforms to expectations.
On the range questions, we should keep in mind concerns about topic reuse. If
we embed index start and end entries in different topics, we run the risk of
breaking the range when the start and end topics are reused independently.
That's part of the rationale for putting ranges in the map.
Thanks,
Erik Hennum
ehennum@us.ibm.com
"JoAnn
Hackos" <joann.hackos@comtech-serv.com>
"JoAnn
Hackos" <joann.hackos@comtech-serv.com>
07/18/2006 07:24 AM
|
To
|
<dita@lists.oasis-open.org>
|
cc
|
"Grosso,
Paul" <pgrosso@ptc.com>, "Chris Wong"
<cwong@idiominc.com>
|
Subject
|
FW: FW: [dita]
indexing question
|
|
Here is the suggestion I received from Bruce in response to
Rodolfo’s clarification. I’m not certain that everyone has seen it.
JoAnn T.
Hackos, PhD
President
Comtech Services, Inc.
710 Kipling Street, Suite 400
Denver, CO
80215
303-232-7586
joann.hackos@comtech-serv.com
joannhackos Skype
www.comtech-serv.com
From: Esrig, Bruce (Bruce) [mailto:esrig@lucent.com]
Sent: Monday, July 17, 2006 1:23 AM
To: JoAnn Hackos
Subject: RE: FW: [dita] indexing question
Hi JoAnn,
Every
two weeks I have a meeting at 10 Eastern that runs for an hour or two. This is
one of those weeks, and it could be a long meeting. So I suspect I won't be
able to attend the translation SC meeting this week.
1.
Thanks to Rodolfo for explaining "breaking" and the related terms. Is
he willing to have his message posted in part or whole to the main DITA list?
May I distribute it within Lucent?
2. I
agree that treating <indexterm> as a subflow in the prolog and as an
inline elsewhere is best among the alternatives presented. Would the
translation SC be opposed to specifying that <indexterm> is filtered on the
way to/from TM in order to distinguish an <indexterm> that is to be
treated as a subflow from an <indexterm> that is to be treated as an
inline? This could be done by creating two up to two artificial elements
<indextermsubflow> and <indexterminline> that are used only in the
TM processing.
If it
were possible to distinguish between subflow and inline uses of indexterm, then
DITA could also offer the following enhancement: add an attribute to suppress
printing in inline contexts, such as <indexterm print="no">.
This takes advantage of the ability to distinguish between a subflow and an
inline. If print="no" is specified, then in an inline context, the
<indexterm> would be treated as a subflow.
3. The
translation SC might wish (especially if the filtering proposal is not
feasible) to recommend a special element <indextermprolog>. The default
treatment of <indexterm> would be as an inline, but
<indextermprolog> would be treated as a subflow. Since DITA 1.1 is
expected to be backward compatible, <indextermprolog> could be an
optional alternative to <indexterm> in prolog contexts in DITA 1.1.
Subsequently, <indextermprolog> could become the standard element for use
in prolog contexts. This approach would still leave room for the print="no"
enhancement.
4. I'm
delighted that Rodolfo separates out the issue of multiple-topic ranges. If
needed, the translation SC could still discuss for approval or disapproval the
point of view that ... those groups that want to support index ranges that span
multiple processes will have to take responsibility for ensuring that their
translation processes support it. For example, such groups could extract their
index range data in advance, translate it in advance, and submit the translated
data with the main body of material to be translated.
Best
wishes,
Bruce
Esrig
-----Original Message-----
From: JoAnn Hackos [mailto:joann.hackos@comtech-serv.com]
Sent: Sunday, July 16, 2006 10:49 PM
To: esrig@lucent.com
Subject: FW: FW: [dita] indexing question
Bruce,
I think
you'll find Rodolfo's email clarifies the action of the translation tools.We're
meeting on this tomorrow. Please send me comments if you cannot attend.
JoAnn
JoAnn T. Hackos, PhD
President
Comtech Services, Inc.
710 Kipling Street, Suite 400
Denver CO
80215
303-232-7586
joann.hackos@comtech-serv.com
From: Rodolfo M. Raya [mailto:rodolfo@heartsome.net]
Sent: Sunday, July 16, 2006 7:41 AM
To: JoAnn Hackos
Cc: Andrzej Zydron
Subject: Re: FW: [dita] indexing question
On Sat, 2006-07-15 at 15:33 -0600,
JoAnn Hackos wrote:
I thought I would forward one of the
recent threads regarding indexing in the TC. Do you see any potential problems
with regard to translation in these proposals? Please let me know more about
the issue with index terms in the prolog pointing to the topic and index terms
in block elements. I believe you referred to the first instance as a breaking
element and the second as a subflow. Is it possible to get a definition and
example of a breaking element and a subflow?
Hi Joann,
Let me start with an explanation on element types (classified from translation
tools point of view). Consider this XML fragment:
<table>
<row>
<col>
<p>Segment one.</p>
</col>
<col>
<p>Segment two. Second sentence.</p>
<p>Segment with <b font="Times">bold</b> text.</p>
<p>Segment with <footnote>some comment</footnote> footnote.</p>
</col>
<col>
</col>
</row>
</table>
|
Six segments can be extracted for translation from the example:
1. Segment one.
2. Segment two.
3. Second
sentence.
4. Segment with
«1»bold«2» text.
5. Segment with «1» footnote.
6. some
comment
We can classify the elements present in the fragment as:
Breaking
|
Elements that contain text fragments that should be
analysed as a unit. A new segment should be created whenever this kind of
element is found at text extraction time. In CAT tools maker jargon, it
"breaks" the segment being processed and starts a new one.
|
<p>
|
Inline
|
Elements that delimit text fragments that should be
analysed as part of the text from the parent element. These elements usually
delimit changes in style.
|
<b>
|
Subflow
|
Elements that contain text fragments that should be
analysed separately. Processing of the enclosing segment does not end. The
element is replaced by a marker in the text and its processing is delayed.
|
<footnote>
|
Ignorable
|
Elements that are not supposed to contain
translatable text and can be discarded, except when they appear as children
of breaking elements in which case they should be regarded as
"inline".
|
<table>, <row>, <col>
|
In the example given above, the element <p> is considered a
"breaking" element because it encloses text that should be extracted
as a unit. Notice that a <p> element may contain several sentences and
require additional processing based on grammar rules that are independent from
XML markup (see items 2. and 3. in the list of segments).
The element <b> is considered "inline" because it does not
contain text that needs to be translated on its own. The text from this element
is supposed to belong to a bigger fragment. The XML markup of inline elements
is irrelevant to translators and it is replaced by "tags" in the
extracted text («1» and «2» in item 4. of list
of segments).
The element <footnote> contains text that can be considered a translation
unit on its own. Its content is related to the enclosing text, but it isn't
part of the enclosing text. At extraction time the content of a
"subflow" element is placed in its own segment and a "tag"
is added in the segment that contains the original context to mark the location
of the material that has been separated (see items 5 and 6 from the list).
When we were discussing last Monday, I initially believed that
<indexterm> was considered an "inline" element. Near the end of
the talk someone clarified that <indexterm> is a "subflow"
element.
Let me try to explain with examples what would be wrong if <indexterm> is
always treated as "inline" or "subflow" element.
<topic>
<prolog>
<indexterm>term
one</indexterm>
<indexterm>term
two</indexterm>
</prolog>
<body>
<p>Paragraph that contains <indexterm>term one</indexterm>
and <indexterm>term two</indexterm> inside.</p>
</body>
</topic>
|
A) If <indexterm> is treated as "inline", we get
these segments after text extraction:
1. «1»term one«2»«3»term two«4»
2. Paragraph
that contains «1»term one«2»
and «3»term two«4»
inside.
In this case, the translation of segment 1 cannot be reused for translating
segment 2.
B) If <indexterm> is considered a "subflow" element,
we will get these strings:
1. term one
2. term two
3. Paragraph
that contains «1»
and «2»
inside.
4. term one
5. term two
In this case, translation of segment 3 becomes complicated because the sentence
lacks relevant portions.
C) If we treat <indexterm> as "subflow" or
"breaking" when it is a child of <prolog> and as
"inline" anywhere else, we get these strings::
1. term one
2. term two
3. Paragraph
that contains «1»term one«2»
and «3»term two«4»
inside.
In this case, translations of segments 1 and 2 can be reused as terminology
entries when translating segment 3.
In my opinion, case C) is the best
one. If stating that an element should be classified differently according to
context is difficult (and I guess it is), then case A) should be considered as more reasonable alternative.
Finally, the discussion of indexes and page ranges that happened in the main
DITA list is irrelevant from translation point of view. It doesn't matter if an
index covers one or more topics/pages.
Best regards,
Rodolfo
--
The information in this e-mail is intended strictly for the addressee,
without prejudices, as a confidential document. Should it reach you, not
being the addressee, it is not to be made accessible to any other
unauthorised person or copied, distributed or disclosed to any other
third party as this would constitute an unlawful act under certain
circumstances, unless prior approval is given for its transmission. The
content of this e-mail is solely that of the sender and not necessarily
that of Heartsome.
|
|
|