dita message

Subject: Generated text in the DITA Open Toolkit
From: Robert D Anderson <robander@us.ibm.com>
To: dita@lists.oasis-open.org
Date: Mon, 9 Jan 2006 16:05:08 -0600

Hello all - I was asked to clarify how the DITA Open Toolkit works with
generated text, so here it is:

As most on this list know, generated text describes any standard text that
is not stored in the source files, but appears in the output. For example,
the text "Related information" that appears above links is generated by the
transform, as is the text "Note" that appears based on a <note> element. As
with most other XML systems, DITA encourages users to keep this common text
outside of the source files, and outside of the core transforms themselves.
The common text is retrieved when the DITA content is published, based on
the language setting in each document (or based on a default set in the
transforms).

In the toolkit, all generated text is kept in XML files outside of the
transform code. If you check in the xsl/common/ directory of the toolkit,
you will see that we have one string file for each supported locale. These
files are named strings-XX-YY.xml, where XX-YY is the locale value. We
chose to use a separate file for each locale based on the advice of many of
our translators, rather than storing every language in a single file.

There is also a file named strings.xml. This file is used to define what
languages are available, and what file should be used for each language.
For example, it indicates that a locale value of "en", "en-us", or "en-gb"
should all use the file strings-en-us.xml for lookup. If the generated text
for en-gb needs to differ at any point, a new file can be created, and the
reference in strings.xml will change.

The primary reason for this redirection is to make it easier to find out
what languages are available, without trying to open files that do not
exist. For example, the toolkit does not yet have language support for the
"sa-in" locale. If that language is encountered, the XSLT should not try to
open the strings-sa-in.xml locale without first trying to see if that file
is available; otherwise, most parsers will generate missing file warnings.
We did not want to keep the list of supported languages in every XSLT
transform, for a couple of reasons. First, if any non-XSLT programs use the
translations, then the supported languages would have to be maintained in
multiple locations. Second, user extensions (with new translations) may not
support the same set of languages as the base toolkit.

HOW THE LOOKUP IS PERFORMED
There is a common XSLT function called getString, which is used to look up
each translation. This function is called with the name of the
lookup-string as a parameter. For example, when generating a heading for a
for the next topic, the function is called as
<xsl:call-template name="getString">
  <xsl:with-param name="stringName" select="'Next topic'"/>
</xsl:call-template>

The getString function determines the currently active language. In most
cases we expect this to be at the level of the <topic> element, but it is
taken from the closest ancestor with an xml:lang attribute. Assume for this
explanation that the current topic is Swedish; so, the language is either
"sv" or "sv-se".

The getString template also has a parameter that tells it where to look for
string information; by default, this is the strings.xml file. It will
search this file, and find that strings-sv-se.xml is the correct place to
find the current string:
  <lang xml:lang="sv"    filename="strings-sv-se.xml"/>
  <lang xml:lang="sv-se" filename="strings-sv-se.xml"/>

The indicated file contains the line:
  <str name="Next topic">Nästa avsnitt</str>

So, the getString template returns "Nästa avsnitt" as the translation.

ADDING NEW TRANSLATIONS
This mechanism was designed to make it possible to add new translations,
particularly in the case of specializations, without having to re-write the
lookup code. For example, assume that I have a music specialization to
describe my music collection. I have a table of bands and albums; so, I
want to generate the headers "Band" and "Albums". For my selection of
Swedish music, I've set the table to xml:lang="sv-se". So, how is this
done--

I've placed all of my XSL and string files in the toolkit directory
demo/music/xsl. When I call the getString template, I need to pass in two
parameters - the first instructs the template on where to look for
translations (relative to the getString template), and the second (as
before) is the string value. So, I pass in
<xsl:call-template name="getString">
  <xsl:with-param
name="stringFileList">../../demo/music/xsl/musicstrings.xml</xsl:with-param>
  <xsl:with-param name="stringName">Group</xsl:with-param>
</xsl:call-template>

The standard getString template looks in my stringFileList instead of the
default location. That file tells me where to go for Swedish translations.
Note that my specialization can support the same languages as the toolkit,
or a subset, or a superset, depending on my needs. The file contains this:
    <lang xml:lang="sv-se" filename="music-sv-se.xml"/>

 I then look for the string in the file music-sv-se.xml, and come up with
"Grupp":
  <str name="Group">Grupp</str>

Of course, it would be much easier to understand all of this with an
example to look at. Erik Hennum has a working example of this as part of
his API Reference specialization, which will be available soon as a plugin
to the toolkit.

OTHER LOCALE PROBLEMS
The toolkit currently accounts for a couple of other locale issues when
generating text. The first is the need to rearrange word order. Currently,
this is only done for Hungarian captions; for example, "Table 1" in English
becomes "1 Táblázat" when translated. This is currently handled directly in
the XSLT code for tables and figures -- when the language is Hungarian, we
generate the number, followed by a space, followed by the table string;
otherwise, we use the string, followed by a space, followed by a number.

The other issue is for French text, where colons in text like "Note:" must
be preceded by a space. In this case, we treat the colon as generated text,
which is retrieved by looking up the value for "ColonSymbol". For French
locales, this consists of a space followed by a colon, while for other
languages it is simply a colon.

DEFAULT LANGUAGES
The Open Toolkit currently uses a default language of US English. This is
set using the DEFAULTLANG parameter (inside the dita-utilities.xsl file,
which also contains getString). To use a different default language, it is
only necessary to reset this value or pass a new locale value to the
transform as a parameter.

The toolkit today supports 47 locales, representing 39 unique languages. It
appears that we ship files today that are not actually referenced - for
example, all English locales point to strings-en-us.xml in the lookup file,
but we ship string files for UK English and Canadian English. The extra
files are not used by the transform today.

New translations that are added as user extensions should be kept together
with the transform code that extends the toolkit. It is not a good idea to
place new translations in the strings file, simply because these may get
updated with new releases of the toolkit. As stated above, user extensions
can support as many or as few languages as needed.

I understand that this is all rather long and convoluted - so, I expect
there to be questions...

Robert D Anderson
IBM Authoring Tools Development
Chief Architect, DITA Open Toolkit