dita-translation message
[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]
| [List Home]
Subject: indexing question
- From: "JoAnn Hackos" <joann.hackos@comtech-serv.com>
- To: <dita-translation@lists.oasis-open.org>,<cwong@idiominc.com>,<mambrose@sdl.com>,<bhertz@sdl.com>,"Bryan Schnabel" <bryan.s.schnabel@tek.com>,<charles_pau@us.ibm.com>,<christian.lieske@sap.com>,<dpooley@sdl.com>,<dschell@us.ibm.com>,<esrig@lucent.com>,<fsasaki@w3.org>,<rfletcher@sdl.com>,"Howard.Schwartz" <Howard.Schwartz@trados.com>,"Jennifer Linton" <jennifer.linton@comtech-serv.com>,<ishida@w3.org>,<tony.jewtushenko@productinnovator.com>,<KARA@CA.IBM.COM>,<ysavourel@translate.com>
- Date: Mon, 17 Jul 2006 06:44:44 -0600
Hello SC members,
This is Bruce Esrig's response to Rodolfo's email.
Rodolfo's original email is below.
Bruce is one of the architects of the 1.1
indexing proposal.
JoAnn
Hi
JoAnn,
Every
two weeks I have a meeting at 10 Eastern that runs for an hour or two. This is
one of those weeks, and it could be a long meeting. So I suspect I won't be
able to attend the translation SC meeting this week.
1.
Thanks to Rodolfo for explaining "breaking" and the related terms. Is he
willing to have his message posted in part or whole to the main DITA list? May I
distribute it within Lucent?
2. I
agree that treating <indexterm> as a subflow in the prolog and as an
inline elsewhere is best among the alternatives presented. Would the translation
SC be opposed to specifying that <indexterm> is filtered on the way
to/from TM in order to distinguish an <indexterm> that is to be treated as
a subflow from an <indexterm> that is to be treated as an inline?
This could be done by creating two up to two artificial elements
<indextermsubflow> and <indexterminline> that are used only in the
TM processing.
If it were possible to distinguish between subflow and inline uses
of indexterm, then DITA could also offer the following enhancement: add an
attribute to suppress printing in inline contexts, such as <indexterm
print="no">. This takes advantage of the ability to distinguish between a
subflow and an inline. If print="no" is specified, then in an inline context,
the <indexterm> would be treated as a subflow.
3. The
translation SC might wish (especially if the filtering proposal is not
feasible) to recommend a special element <indextermprolog>. The
default treatment of <indexterm> would be as an inline, but
<indextermprolog> would be treated as a subflow. Since DITA 1.1 is
expected to be backward compatible, <indextermprolog> could be an optional
alternative to <indexterm> in prolog contexts in DITA 1.1. Subsequently,
<indextermprolog> could become the standard element for use in prolog
contexts. This approach would still leave room for the print="no"
enhancement.
4. I'm
delighted that Rodolfo separates out the issue of multiple-topic ranges. If
needed, the translation SC could still discuss for approval or disapproval the
point of view that ... those groups that want to support index ranges that
span multiple processes will have to take responsibility for ensuring that their
translation processes support it. For example, such groups could extract their
index range data in advance, translate it in advance, and submit the translated
data with the main body of material to be translated.
Best
wishes,
Bruce
Esrig
Bruce,
I think you'll find Rodolfo's email clarifies the action
of the translation tools.We're meeting on this tomorrow. Please send me
comments if you cannot attend.
JoAnn
On Sat, 2006-07-15 at 15:33 -0600, JoAnn Hackos wrote:
I thought I would
forward one of the recent threads regarding indexing in the TC. Do you see
any potential problems with regard to translation in these proposals? Please
let me know more about the issue with index terms in the prolog pointing to
the topic and index terms in block elements. I believe you referred to the
first instance as a breaking element and the second as a subflow. Is it
possible to get a definition and example of a breaking element and
a subflow?
Hi Joann,
Let me
start with an explanation on element types (classified from translation tools
point of view). Consider this XML fragment:
<table>
<row>
<col>
<p>Segment one.</p>
</col>
<col>
<p>Segment two. Second sentence.</p>
<p>Segment with <b font="Times">bold</b> text.</p>
<p>Segment with <footnote>some comment</footnote> footnote.</p>
</col>
<col>
</col>
</row>
</table>
|
Six segments can be extracted for
translation from the example:
- Segment one.
- Segment two.
- Second sentence.
- Segment with «1»bold«2» text.
- Segment with «1» footnote.
- some comment
We
can classify the elements present in the fragment as:
Breaking |
Elements that contain text fragments that should be analysed as a
unit. A new segment should be created whenever this kind of element is
found at text extraction time. In CAT tools maker jargon, it "breaks"
the segment being processed and starts a new one. |
<p> |
Inline |
Elements that delimit text fragments that should be analysed as part
of the text from the parent element. These elements usually delimit
changes in style. |
<b> |
Subflow |
Elements that contain text fragments that should be analysed
separately. Processing of the enclosing segment does not end. The
element is replaced by a marker in the text and its processing is
delayed. |
<footnote> |
Ignorable |
Elements that are not supposed to contain translatable text and can
be discarded, except when they appear as children of breaking elements
in which case they should be regarded as "inline". |
<table>, <row>, <col>
|
In the example given above, the element
<p> is considered a "breaking" element because it
encloses text that should be extracted as a
unit. Notice that a <p> element may contain several sentences and
require additional processing based on grammar rules that are independent from
XML markup (see items 2. and 3. in the list of
segments).
The element <b> is considered "inline" because
it does not contain text that needs to be
translated on its own. The text from this element is supposed to belong to a
bigger fragment. The XML markup of inline elements is irrelevant to
translators and it is replaced by "tags" in the extracted text («1» and «2» in item 4. of list of
segments).
The element <footnote>
contains text that can be considered a translation unit on its own. Its
content is related to the enclosing text, but it isn't part of the enclosing
text. At extraction time the content of a "subflow" element is placed in its
own segment and a "tag" is added in the segment that contains the original
context to mark the location of the material that has been separated (see
items 5 and 6 from the list).
When we were discussing last Monday, I
initially believed that <indexterm> was considered an "inline" element.
Near the end of the talk someone clarified that <indexterm> is a
"subflow" element.
Let me try to explain with examples what would be
wrong if <indexterm> is always treated as "inline" or "subflow" element.
<topic>
<prolog>
<indexterm>term one</indexterm>
<indexterm>term two</indexterm>
</prolog>
<body>
<p>Paragraph that contains <indexterm>term one</indexterm>
and <indexterm>term two</indexterm> inside.</p>
</body>
</topic>
|
A) If <indexterm> is
treated as "inline", we get these segments after text extraction:
- «1»term one«2»«3»term two«4»
- Paragraph that contains
«1»term
one«2» and «3»term two«4» inside.
In this case, the translation of
segment 1 cannot be reused for translating segment 2.
B) If
<indexterm> is considered a "subflow" element, we will get these
strings:
- term one
- term two
- Paragraph that contains
«1» and «2» inside.
- term one
- term two
In this
case, translation of segment 3 becomes complicated because the sentence lacks
relevant portions.
C) If we treat <indexterm> as "subflow"
or "breaking" when it is a child of <prolog> and as "inline" anywhere
else, we get these strings::
- term one
- term two
- Paragraph that contains
«1»term
one«2» and «3»term two«4» inside.
In this case, translations of
segments 1 and 2 can be reused as terminology entries when translating segment
3.
In my opinion, case C) is the best one. If stating that an
element should be classified differently according to context is difficult
(and I guess it is), then case A) should be considered as more
reasonable alternative.
Finally, the discussion of indexes and page
ranges that happened in the main DITA list is irrelevant from translation
point of view. It doesn't matter if an index covers one or more topics/pages.
Best regards,
Rodolfo
-- The information in this e-mail is intended
strictly for the addressee, without prejudices, as a
confidential document. Should it reach you, not being the
addressee, it is not to be made accessible to any other
unauthorised person or copied, distributed or disclosed to
any other third party as this would constitute an unlawful
act under certain circumstances, unless prior approval is
given for its transmission. The content of this e-mail is
solely that of the sender and not necessarily that of
Heartsome.
| | |
[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]
| [List Home]