Everyone,
Please review the
discussion that has occurred around indexing. We need to make this the major
topic of next week’s discussion. Everyone seems to be confused about the
concerns of indexing on translation.
JoAnn
JoAnn T. Hackos,
PhD
President
Comtech Services,
Inc.
710 Kipling Street,
Suite 400
Denver, CO
80215
303-232-7586
joann.hackos@comtech-serv.com
joannhackos
Skype
www.comtech-serv.com
From: Erik
Hennum [mailto:ehennum@us.ibm.com]
Sent: Tuesday, July 18, 2006 9:07
AM
To: JoAnn Hackos
Cc: Chris Wong; dita@lists.oasis-open.org;
Grosso, Paul
Subject: Re: FW:
FW: [dita] indexing question
Hi, JoAnn and Bruce:
Regardless of whether an
index entry is a point (as Chris suggests) or a span (an alternative view), an
index entry clearly should never have an impact on flow. An index entry is an
annotation on the content much like a metadata property.
So, I would
disagree with the recommendation to treat the index entry as an inline if there
is any implication of affecting the layout or the parsing of text. An index
entry could appear in the middle of a word -- it shouldn't make any difference
in the processing.
Regarding interpretation of an index entry based on
its container, that applies to an index entry in the prolog. It should be
interpretted based on the topic (which is the effective container of everything
in the prolog). In particular, for an index entry in the prolog, the index is
either a point attached to the start of the topic or a span covering the entire
topic.
Even if the processing of index entries _were_ different in
different contexts, I don't see that this would necessarily requires a different
element name so long as the processing conforms to expectations.
On the
range questions, we should keep in mind concerns about topic reuse. If we embed
index start and end entries in different topics, we run the risk of breaking the
range when the start and end topics are reused independently. That's part of the
rationale for putting ranges in the map.
Thanks,
Erik
Hennum
ehennum@us.ibm.com
"JoAnn Hackos"
<joann.hackos@comtech-serv.com>
"JoAnn Hackos"
<joann.hackos@comtech-serv.com>
07/18/2006 07:24
AM |
To |
<dita@lists.oasis-open.org>
|
cc |
"Grosso, Paul" <pgrosso@ptc.com>,
"Chris Wong"
<cwong@idiominc.com>
|
Subject |
FW: FW: [dita] indexing
question
|
|
Here is the suggestion I received from
Bruce in response to Rodolfo’s clarification. I’m not certain that everyone has
seen it.
JoAnn T. Hackos,
PhD
President
Comtech Services, Inc.
710 Kipling Street, Suite 400
Denver, CO 80215
303-232-7586
joann.hackos@comtech-serv.com
joannhackos
Skype
www.comtech-serv.com
From: Esrig, Bruce (Bruce) [mailto:esrig@lucent.com]
Sent: Monday, July 17, 2006 1:23
AM
To: JoAnn Hackos
Subject: RE: FW: [dita] indexing
question
Hi
JoAnn,
Every two weeks I have a meeting at 10
Eastern that runs for an hour or two. This is one of those weeks, and it could
be a long meeting. So I suspect I won't be able to attend the translation SC
meeting this week.
1. Thanks to Rodolfo for explaining
"breaking" and the related terms. Is he willing to have his message posted in
part or whole to the main DITA list? May I distribute it within
Lucent?
2. I agree that treating
<indexterm> as a subflow in the prolog and as an inline elsewhere is best
among the alternatives presented. Would the translation SC be opposed to
specifying that <indexterm> is filtered on the way to/from TM in order to
distinguish an <indexterm> that is to be treated as a subflow from an
<indexterm> that is to be treated as an inline? This could be done by
creating two up to two artificial elements <indextermsubflow> and
<indexterminline> that are used only in the TM
processing.
If it were possible to distinguish
between subflow and inline uses of indexterm, then DITA could also offer the
following enhancement: add an attribute to suppress printing in inline contexts,
such as <indexterm print="no">. This takes advantage of the ability to
distinguish between a subflow and an inline. If print="no" is specified, then in
an inline context, the <indexterm> would be treated as a
subflow.
3. The translation SC might wish
(especially if the filtering proposal is not feasible) to recommend a special
element <indextermprolog>. The default treatment of <indexterm>
would be as an inline, but <indextermprolog> would be treated as a
subflow. Since DITA 1.1 is expected to be backward compatible,
<indextermprolog> could be an optional alternative to <indexterm> in
prolog contexts in DITA 1.1. Subsequently, <indextermprolog> could become
the standard element for use in prolog contexts. This approach would still leave
room for the print="no" enhancement.
4. I'm delighted that
Rodolfo separates out the issue of multiple-topic ranges. If needed, the
translation SC could still discuss for approval or disapproval the point of view
that ... those groups that want to support index ranges that span multiple
processes will have to take responsibility for ensuring that their translation
processes support it. For example, such groups could extract their index range
data in advance, translate it in advance, and submit the translated data with
the main body of material to be translated.
Best
wishes,
Bruce Esrig
-----Original Message-----
From: JoAnn Hackos [mailto:joann.hackos@comtech-serv.com]
Sent: Sunday, July 16, 2006 10:49
PM
To: esrig@lucent.com
Subject: FW: FW: [dita] indexing
question
Bruce,
I think you'll find
Rodolfo's email clarifies the action of the translation tools.We're meeting on
this tomorrow. Please send me comments if you cannot
attend.
JoAnn
JoAnn T.
Hackos, PhD
President
Comtech Services, Inc.
710 Kipling Street, Suite
400
Denver CO 80215
303-232-7586
joann.hackos@comtech-serv.com
From: Rodolfo M. Raya [mailto:rodolfo@heartsome.net]
Sent: Sunday, July 16, 2006 7:41
AM
To: JoAnn Hackos
Cc: Andrzej Zydron
Subject: Re: FW: [dita] indexing
question
On Sat,
2006-07-15 at 15:33 -0600, JoAnn Hackos wrote:
I thought I would forward one of the recent
threads regarding indexing in the TC. Do you see any potential problems with
regard to translation in these proposals? Please let me know more about the
issue with index terms in the prolog pointing to the topic and index terms in
block elements. I believe you referred to the first instance as a breaking
element and the second as a subflow. Is it possible to get a definition and
example of a breaking element and a subflow?
Hi Joann,
Let me start with an explanation
on element types (classified from translation tools point of view). Consider
this XML fragment:
<table> <row> <col> <p>Segment one.</p>
</col> <col> <p>Segment two. Second
sentence.</p> <p>Segment with <b
font="Times">bold</b> text.</p> <p>Segment
with <footnote>some comment</footnote> footnote.</p> </col> <col> </col> </row> </table> |
Six segments can be extracted for
translation from the example:
1. Segment
one.
2. Segment two.
3. Second
sentence.
4. Segment with «1»bold«2»
text.
5. Segment with «1»
footnote.
6. some comment
We can classify the elements present
in the fragment as:
Breaking |
Elements that contain text fragments that should
be analysed as a unit. A new segment should be created whenever this kind
of element is found at text extraction time. In CAT tools maker jargon, it
"breaks" the segment being processed and starts a new one.
|
<p> |
Inline |
Elements that delimit text fragments that should
be analysed as part of the text from the parent element. These elements
usually delimit changes in style. |
<b> |
Subflow |
Elements that contain text fragments that should
be analysed separately. Processing of the enclosing segment does not end.
The element is replaced by a marker in the text and its processing is
delayed. |
<footnote>
|
Ignorable |
Elements that are not supposed to contain
translatable text and can be discarded, except when they appear as
children of breaking elements in which case they should be regarded as
"inline". |
<table>, <row>, <col>
|
In the example given above, the
element <p> is considered a "breaking" element because it encloses text
that should be extracted as a unit. Notice that a <p> element may contain
several sentences and require additional processing based on grammar rules that
are independent from XML markup (see items 2. and 3. in the list of
segments).
The element <b> is considered "inline" because it does
not contain text that needs to be translated on its own. The text from this
element is supposed to belong to a bigger fragment. The XML markup of inline
elements is irrelevant to translators and it is replaced by "tags" in the
extracted text («1» and «2» in item 4. of
list of segments).
The element <footnote> contains text that can be
considered a translation unit on its own. Its content is related to the
enclosing text, but it isn't part of the enclosing text. At extraction time the
content of a "subflow" element is placed in its own segment and a "tag" is added
in the segment that contains the original context to mark the location of the
material that has been separated (see items 5 and 6 from the list).
When
we were discussing last Monday, I initially believed that <indexterm> was
considered an "inline" element. Near the end of the talk someone clarified that
<indexterm> is a "subflow" element.
Let me try to explain with
examples what would be wrong if <indexterm> is always treated as "inline"
or "subflow" element.
<topic> <prolog> <indexterm>term
one</indexterm> <indexterm>term
two</indexterm> </prolog> <body> <p>Paragraph that contains
<indexterm>term one</indexterm>
and <indexterm>term two</indexterm> inside.</p> </body> </topic> |
A) If <indexterm> is treated as
"inline", we get these segments after text extraction:
1. «1»term
one«2»«3»term
two«4»
2. Paragraph that
contains «1»term one«2» and «3»term two«4» inside.
In this case, the translation of
segment 1 cannot be reused for translating segment 2.
B) If <indexterm> is considered a
"subflow" element, we will get these strings:
1. term one
2. term two
3. Paragraph that
contains «1» and «2» inside.
4.
term
one
5. term two
In this case, translation of segment
3 becomes complicated because the sentence lacks relevant portions.
C) If we treat <indexterm> as
"subflow" or "breaking" when it is a child of <prolog> and as "inline"
anywhere else, we get these strings::
1. term one
2. term two
3. Paragraph that
contains «1»term one«2» and «3»term two«4» inside.
In this case, translations of
segments 1 and 2 can be reused as terminology entries when translating segment
3.
In my opinion, case C)
is the best one. If stating that an element should be classified differently
according to context is difficult (and I guess it is), then case A) should be considered as more reasonable
alternative.
Finally, the discussion of indexes and page ranges that
happened in the main DITA list is irrelevant from translation point of view. It
doesn't matter if an index covers one or more topics/pages.
Best
regards,
Rodolfo
-- The information in this e-mail
is intended strictly for the addressee, without prejudices, as
a confidential document. Should it reach you, not being the
addressee, it is not to be made accessible to any other
unauthorised person or copied, distributed or disclosed to any
other third party as this would constitute an unlawful act
under certain circumstances, unless prior approval is given
for its transmission. The content of this e-mail is solely
that of the sender and not necessarily that of Heartsome.
|
|
|