dita-translation message

Subject: Re: [dita-translation] From Bruce Esrig: Notes on the acronym proposal
From: Robert D Anderson <robander@us.ibm.com>
To: "JoAnn Hackos" <joann.hackos@comtech-serv.com>
Date: Mon, 2 Jul 2007 13:10:05 -0500
Hi Bruce, JoAnn, and others,

I am a little confused by one of the points here:
> If the short form can appear second in non-declining languages and first
> in declining languages, then it is tempting to create a
> language-specific processing behavior. If the short form and expanded
> form are kept in separate elements, then processing can present them in
> an order appropriate to the language.
>
> An objection to this is that this capability would have to be
> implemented in all conforming DITA processing systems. There are DITA
> processing systems that are modifications of the DITA Open Toolkit or
> completely independent implementations from it. If a language-specific
> processing behavior is defined, it would not be sufficient to implement
> it only in the DITA Open Toolkit.

I worry that we are trying to design the language such that no processor
has to have any knowledge of the <acronym> element. While this would make
my own life easier with regards to the toolkit, it feels like the wrong
thing to do. In general, processors expect to implement something as part
of supporting a new element. Conforming DITA processors do this all the
time. Some elements are easy, such as <b>, which only requires a bit of
highlighting. Others are more difficult, such as <properties>, which is
expected to display as a table with localized default headings. Creating a
localized rule for acronyms seems similar to supporting an element that
requires a new localized string.

Here is my understanding, with regards to an acronym's first occurrence:
* Some languages use the long form first
* Some languages use the short form first, to get around declension or
capitalization problems
* It is not the specification's role to mandate which language does what;
that is left up to renderers.

* Processors necessarily support a limited number of languages (for
generated text, display direction, etc)
* A processor should be able to discover the general acronym preference for
each language that it already supports. This is done once, before the
product release.
* A processor MAY allow users to override the setting for one or all
languages

* Difficulties only arise when a language needs a mix of short-first and
long-first. We at the Translation SC must then determine:
1. For those languages, is it still appropriate to specify a general rule -
95% do one thing, with some exceptions? We would seem to need a way to mark
those exceptions.
2. Is there ever a case where no rule can be specified - 50% do one, and
50% do another? This would be more difficult to accommodate in markup.

As someone who works on a DITA processing system, creating a rule for each
language that I already support seems pretty straight-forward. I do not
think that the toolkit is different than other processing systems, which
makes me wonder about this objection. Am I missing something in my summary?
My understanding of acronym issues is based mostly on the discussions we've
had in this group, so is there more to this than I realize?

Thanks -

Robert D Anderson
IBM Authoring Tools Development
Chief Architect, DITA Open Toolkit
(507) 253-8787, T/L 553-8787 (Good Monday & Thursday)


                                                                           
             "JoAnn Hackos"                                                
             <joann.hackos@com                                             
             tech-serv.com>                                             To 
                                       <dita-translation@lists.oasis-open. 
             06/29/2007 08:08          org>, <mambrose@sdl.com>,           
             AM                        <bhertz@sdl.com>, "Bryan Schnabel"  
                                       <bryan.s.schnabel@tek.com>, Charles 
                                       Pau/Cambridge/IBM@Lotus,            
                                       <christian.lieske@sap.com>,         
                                       <dpooley@sdl.com>, Dave A           
                                       Schell/Raleigh/IBM@IBMUS,           
                                       <esrig-ia@esrig.com>,               
                                       <fsasaki@w3.org>,                   
                                       <rfletcher@sdl.com>,                
                                       "Howard.Schwartz"                   
                                       <Howard.Schwartz@trados.com>,       
                                       <ishida@w3.org>,                    
                                       <tony.jewtushenko@productinnovator. 
                                       com>, <KARA@CA.IBM.COM>,            
                                       <ysavourel@translate.com>           
                                                                        cc 
                                                                           
                                                                   Subject 
                                       [dita-translation] From Bruce       
                                       Esrig: Notes on the acronym         
                                       proposal                            
                                                                           
                                                                           
                                                                           
                                                                           
                                                                           
                                                                           






-----Original Message-----
From: Bruce Esrig [mailto:esrig@alumni.princeton.edu]
Sent: Friday, June 29, 2007 3:43 AM
To: JoAnn Hackos
Subject: Re: Notes on the acronym proposal

Hi JoAnn,

Could you echo this to the list?

When it's time to finalize this proposal, it might be helpful to have a
wiki page that everyone can see (for example a temporary wiki page at
dita.xml.org) that contains the current proposal, so that when a change
is agreed upon, people can see the current proposal by using refresh in
their browsers.

Bruce

========

1. //Language-by-language rules
         We have tried to avoided creating language-by-language rules
because the
         processing overhead is not trivial. Kara's recommendation would
require a
         language-by-language rule.//

To explain this note more fully:

Language-by-language rules arise when some languages require declension
of words in the expanded form, while others do not. In languages that do
not, it is feasible to show the expanded form first with the short form
in parentheses. Some style guides require this. In languages that do,
there is an advantage to putting the short form first, since then the
language rules would (in most languages?) permit the short form not to
be declined. This is good for translation because a single instance of
the term can be maintained in the terminology base and used in multiple
grammatical contexts.

If the short form can appear second in non-declining languages and first
in declining languages, then it is tempting to create a
language-specific processing behavior. If the short form and expanded
form are kept in separate elements, then processing can present them in
an order appropriate to the language.

An objection to this is that this capability would have to be
implemented in all conforming DITA processing systems. There are DITA
processing systems that are modifications of the DITA Open Toolkit or
completely independent implementations from it. If a language-specific
processing behavior is defined, it would not be sufficient to implement
it only in the DITA Open Toolkit.

Note, however, that the DITA Open Toolkit serves precisely this purpose
in the community: to demonstrate that the requirements in the
specification can be met, and to provide a reference implementation that
does meet those requirements. We would want to know from other vendors
how burdensome they would find a language-specific rule if it were "the
right thing to do".

As an alternative, a flag of some sort could be used to indicate whether
to present the short form or expanded form first. The suggestion on the
call, to put this flag at the element level, would be difficult to
maintain since all elements would behave the same way within a given
language. It's better to put the dependency at a global level, either in
a flag that controls the order or in a deduction that is made
automatically once the language is known. Basing the order on the
language is more reliable since otherwise the flag has to be set
correctly when the processing job is set up.

However, the flag may be required as an override to the default for a
language. The override may need to be language-specific. Some
non-declining languages may need to support two orders depending on what
the local style guide says about ordering in that language. Another
alternative that was not on the table in the most recent discussion is
to implement acronyms in only one way, with the short form first.

Regarding using a combined form and extracting the pieces from it, there
is still the requirement to know which order to present the pieces in.
This means that there is no reason to break the XML convention of
putting separate pieces of information in separate elements in the
source.

2. In case it helps to look at what we're simplifying away from ...
Another case that the current proposal does not support is versioning.
Suppose that a terminology bank has multiple historically-accurate but
time-bounded entries for a term.

An example that comes to mind is described in
http://en.wikipedia.org/wiki/Timeline_of_AIDS, namely: "1986: HIV (human
immunodeficiency virus) is adopted as name of the retrovirus that was
first proposed as the cause of AIDS by Luc Montagnier of France, who
named it LAV (lymphadenopathy associated virus) and Robert Gallo of the
United States, who named it HTLV-III (human T-lymphotropic virus type
III) ".

If we wished to record a relationship among these terms in the source in
DITA, we would need two IDs for the term: one for the surface form and
one for the meaning. The meaning is the term bank entry that recognizes
the connection among the alternate surface forms. This could be done by
treating the ID in the term as a reference to the surface form and
providing the ID for the meaning, when required, within a <data> element
nested within the outermost <acronym> element.

A passage that referred to multiple terms would do so by using each
surface form as needed, and indicating the connection among them using
the <data> element.

The DITA markup for this passage would be:

<p>1986</p>
<ul><li>
   <acronym id="hiv-current">
       <data name="meaning" value="hiv">
       <short>HIV</short><expanded>human immunodeficiency
virus</expanded>
   </acronym>
is adopted as name of the retrovirus that was first proposed as the
cause of AIDS by Luc Montagnier of France, who named it
   <acronym id="hiv-montagnier">
       <data name="meaning" value="hiv">
       <short>LAV</short><expanded>lymphadenopathy associated
virus</expanded>
   </acronym>
and Robert Gallo of the United States, who named it
   <acronym id="hiv-gallo">
       <data name="meaning" value="hiv">
       <short>HTLV-III</short><expanded>human T-lymphotropic virus type
III</expanded>
   </acronym>.
</li></ul>

According to this markup, <data> would need to be permitted within
<keyword> since <acronym> is proposed as a specialization of <keyword>.

3. We may need to do more work to unify the term-like elements in DITA.
As of February 2007, the DITA 1.1. Architecture guide treats "metadata"
separately, but seems to lack a thorough statement on terminology.
<keyword>, <indexterm>, and <term> have related behaviors, and may need
to be managed in parallel.

Applying this question to the <acronym> proposal ... As in the case of
the <keyword> element, the <data> element would probably need to be
supported within <term> and most likely <indexterm>.

At 12:54 PM 6/27/2007, you wrote:
>Hello Friends,
>I've added more notes to Gershon's meeting minutes. Andrzej and Rodolfo

>in particular, please review the notes. I've also asked Kara W to
>review the entire proposal since she had not yet read it.
>
>We have two proposals in the notes that we need to consider next week.
>Each
>Involves adding a third element to our plan.
>
>Kara's primary concern seems to be with the post-processing for term
>extraction.
>
>I also wonder if that processing could not be revised to account for
>the acronym in the expanded form rather than adding complexity to this
>proposal.
>
>JoAnn
>
>JoAnn T. Hackos, PhD
>President
>Comtech Services, Inc.
>710 Kipling Street, Suite 400
>Denver, CO 80215
>303-232-7586
>joann.hackos@comtech-serv.com
>joannhackos Skype
>www.comtech-serv.com
>
>