Localization Directives Requirement

Proposal, 11 June 2002

Abstract

This document is a proposal that outlines the possible requirements for a mechanism to specify localization directives in an arbitrary XML document.

1. Introduction

1.1. Purpose
1.2. Terminology
1.3. Examples

3. Implementation Constraints

2. Requirements

2.1. Identification of Localizable Parts
2.2. Identification of Non-Localizable Parts
2.3. Identification of Terms
2.4. Notes to Localizers
2.5. Allowed Characters
2.6. Unique Identifier
2.7. Indication of Container Size
2.8. Citations and References
2.9. Segmentation
2.10. Language of the Content
2.11. White Space Handling
3.12. Identification of Changes
3.13. Identification of Datatypes

Appendices

A. References

1. Introduction

In order to localize XML data in a cost-effective and time-efficient manner, a number of conditions must exist. These conditions can be set in different ways:

By following specific guidelines and, possibly, using specific constructs when developing schemas or DTDs. These constructs are the localization properties of a document type.
By following specific guidelines and, possibly, using specific constructs when authoring XML documents. These construct are the localization directives.

The set of requirements outlined in this document focuses on the localization directives.

1.1. Purpose

The purpose of the Localization Directives vocabulary is to provide a way to embed, within any XML document instance, information related to localization that is not specified at the document type level (i.e. in the schema or DTD), or information that should override document type level properties.

For example, a directive could be used to specify that a sequence of words is a term entry, or that the content of an element usually translatable is not in a specific instance. You can see the localization directives as a complement to the localization properties. A localization directive can constitute in some cases an override of the default localization property.

This document outlines the different requirements an XML vocabulary implementation localization directives should take in account. Most of these requirements have been already outlined in the ITS Requirements document [ITS Req].

1.2. Terminology

This document uses the following terms:

Author:: The user that creates and modifies the source document.
Item:: In this context, an element or an attribute, and, by extension, its corresponding content or value.
Localization directive:: A piece of metadata providing localization-related information in a document instance.
Localization properties:: Information in a schema or DTD describing the localization constraints of elements and attributes.
Localizer:: The user responsible for the localization of the document.
Primary format:: The main document type of the source document.
Source document:: The XML document where the localization directives are inserted.

1.3. Examples

The examples provided with each requirement aim at illustrating the context where such requirement exists and how a possible localization directive could be implemented. They are only arbitrary example of possible implementations (sometimes very different from each other).

For simplification, the examples are often partial block of XML code without the proper namespace declarations. The imaginary localization directives markup is in bold and uses the namespace qualifier ld.

2. Implementation Constraints

An implementation of localization directives should address the following constraints:

It should be easy to map the directives to XLIFF constructs.
The directives should be usable in any XML document type allowing it.
The directives should take in account needs pertaining for both documentation-type content and resource-type data.
The specification should establish a clear priority between information from the primary format and the information from the localization directives. It should provide an unequivocal way to solve such conflicts.
The specification should describe a clear mechanism to set the scope of a localization directive.
The specification should provide a schema for the Localization Directives vocabulary written in a way it could be imported into other schemas.
The directives should designed in a way that they could be possibly re-used, after some simple adaptation, as a base for localization directives in non-XML applications (e.g. CSS style sheets, resource files, scripts, etc.)
It should be possible for tools to process the directives without knowing which authoring tool or vendor produced it.
The directives should be easy to implement for the developers of authoring tools.
The directives should be easy to implement for the developers of translation tools.

3. Requirements

This section describes the different requirements that should possibly addressed with directives for localization.

The requirements are listed in no particular order. As stated previously, most of them have been taken directly or adapted from the ITS Requirements document [ITS Req].

3.1. Identification of Localizable Parts

It must be possible to specify to the localizer that a given item should be changed during localization. This may refer to a single character or a large chunk of data; it may refer to text data, structural items, or graphic or multimedia entities.

The method used should allow localization tools to automatically identify and isolate the specific data in question. It must be possible to apply this information to any element or range of text and have that information depends on any other element, attributes, or combinations.

Example of possible usage:

<data name="main_url">
 <value>http://www.xliff.org</value>
</data>
<data name="main_title" ld:localize="yes">
 <value>XLIFF Home</value>
</data>
<data name="str_dbconnect">
 <value>File Name=C:\SomePath\myDatabaseName.udl;</value>
</data>

3.2. Identification of Non-Localizable Parts

It must be possible to specify to the localizer that a given item should not be changed during localization. This may refer to a single character or a large chunk of data; it may refer to text data, structural items, or graphic or multimedia entities.

Examples of possible usage:

<p>The following legal note should always appear at the top
   of the document:</p>
<p ld:localize="no">Copyright © 1420-2002 - Metallographic 
Printing GmbH.</p>

<p>See the <ld:span localize="no">XML Inclusions (XInclude) 
Version 1.0</ld:span> document for more information.</p>

<p>See the <ld:translate>XML Inclusions (XInclude) 
Version 1.0</ld:translate> document for more information.</p>

In some cases in may be useful to be able to specify regular expressions patterns to specify chunks of data in text that should not be localized. Cases like this would be for example when the data cannot be marked up or is inconvenient to markup (server side files, etc.).

<body>
 <ld:patterns>
  <ld:rule type="protect">\$#*\$</ld:rule>
 </ld:patterns>
 <p>Hello $#varName$. You are now logged into $#varHost$.</p>
...

3.3. Identification of Terms

It should be possible to indicate that a given item or a span of text is a term.

Terminology and translation tools, as well as QA utilities can make use of such markup to provide terminology lists, validate translation, and so forth.

Examples of possible usage:

<p>Here, the term <ld:term>localization directives</ld:term> 
refers to the markup that allows the developers to insert 
information for the localization team directly into the source 
files.</p>

<dt ld:term="yes">Localization directive</dt>
<dd>A piece of metadata providing localization-related information 
in a document instance.</dd>

3.4. Notes to Localizers

A method should exist for authors to communicate information to localizers about a particular item. There should be two such types of information:
a) notes that must be read before the localizer attempts to localize;
b) notes that provide optional background information. The method should allow localization tools to be able to automatically identify and isolate the specific data to which the note refers, and automatically distinguish between the two different types of note.

To assist the translator to achieve a correct translation, authors may need to provide information about the text that they have written. For example, the author may want to:

tell the translator how to translate part of the content
expand on the meaning or contextual usage of a particular element, such as what a variable refers to or how a string will be used on the UI
clarify ambiguity and show relationships between items sufficiently to allow correct translation (e.g. in many languages it is impossible to translate the word "enabled" in isolation without knowing the gender, number and case of the thing it refers to.)
explain why text is not translated, point to text re-use, or describe the use of conditional text
indicate why a piece of text is emphasized (important, sarcastic, etc.)
etc.

This can help translators avoid mistakes or avoid spending time searching for information.

Two types of developer's note are needed:

An alert - An alert contains information that the translator must read before translating a piece of text. The translation environment must bring this type of note to the attention of the translator before they begin to translate. (For example, an instruction to the translator to leave parts of the text in the source language.)
A description - A description provides useful background information that the translator will refer to only if they wish. (For example, a clarification of ambiguity in the source text). The translation tool would still make this available to the translator, but would not force them to read it before attempting a translation. The translator may only receive an indication that such a note exists and have to take action to view the text.

Examples of possible usage:

<data name="mnu_file">
 <ld:note priority="important">
  The word 'file' refers to the noun, not the verb
 </ld:note>
 <value>&amp;File</value>
</data>

<data name="welcome"
 ld:note="XMLF = XML Foundry">
 <value>Welcome to the XMLF project.</value>
</data>

<varDef name="onoff"
 ld:note-important="Values used in 'PowerSave mode: (on/off)'">
 <value id="onoff_1">ON</value>
 <value id="onoff_1">OFF</value>
</varDef>

3.5. Allowed Characters

It should be possible to restrict for a given item, the kind of characters that can be used in the translation. The expression of the character class allowed should be flexible enough to allow enumeration, but also specification of ranges.

For example, a firmware interface may allow to use only a limited range of characters due to font limitation, therefore the full set of Japanese Kanji characters may not be available and the translation should be limited to ASCII, Hiragana and Katakana characters.

Example of possible usage:

<firmwareStringTable ld:charclass="ascii">
 ...
</firmwareStringTable>

<firmwareStringTable ld:charclass="U+00??,U+3100-312F">
 <!-- Characters allowed: ASCII and Bopomofo -->
 ...
</firmwareStringTable>

3.6. Unique Identifier

It should be possible to attach a unique identifier to any localizable item - be it text, structure or unparsed entity. This identifier, ideally, should be completely unique across all documents.

In order to most effectively re-use translated text where content is re-used (either across update versions or across deliverables) it is necessary to have a totally unique and eternally persistent identifier associated with the element. This identifier allows the translation tool to correctly associate source and translated text units with each other prior to examination for changes, and track an item from one version or location to the next. After one is sure that this is the same item, the content can be examined for changes, and if no change has taken place the potential for re-use of the previous translation is very high.

This approach can be referred to as change analysis. The potential for re-use of translations is very appealing in terms of productivity and cost savings for product launch. Change analysis constitutes an extremely powerful productivity tool for translation when compared to the typical source matching techniques implemented in translation memory systems, which simply look for similar source text in the database without being able to tell whether the context of its use is the same. This change analysis technique has been possible with UI messages in the past, but the introduction of structured XML documents will allow for its use in documents also.

Where text entities will be re-used across products, or where a localizer is dealing with these ids must be totally unique.

Example of possible usage:

<head>
 <meta http-equiv="Content-Type"
       content="text/html; charset=windows-1252"/>
 <meta http-equiv="Content-Language"
       content="en-us"/>
 <title ld:id="51C46563-B626-4da8-9DC1-C9278484C6D6"
 >XLIFF 1.1 Specification</title>
</head>
<body>
 <p id="825CDF9C-6F8E-4251-BBC0-66E0F504B3D3">This document 
presents the official specification for XLIFF 1.1</p>
...

3.7. Indication of Container Size

Where fixed sizes are used for containers or objects (such as tables, table cells, frames, buffers, screens, images, etc.) a standard method should be used for indicating the dimensions of the container so that localization tools can automatically recognize them.

As many system use byte or other units for measurement unit, the mechanism should allow to specify units different from a character and a clear mechanism should be set in place to allow the tools to know which type of character encoding to use to verify these size restrictions.

There is a particular case where the size of a container can be in bytes, but the result depends on which character encoding is used.

Examples of possible usage:

<lcdItem id="powermode_shared" menuType="main">
 <text ld:maxbytes="20">PowerSave mode: <var ref="onoff"/></text>
</lcdItem>

<data ld:width-unit="char">
 <row uid="GOPK56-SPR2002-123">
  <PackageTitle ld:maxwidth="20">Cancun Paradise</PackageTitle>
  <Desc ld:maxwidth="255">Five days all-inclusive in Tulum most 
   luxurious hotel</Desc>
 </row>
</data>

3.8. External References

Any reference to an external text content should be accompanied by information about its source. A standard approach should be used to identify the source so that localization tools can automatically retrieve the information about the source.

For example, quotations of user interface messages in documentation text should be implemented in such a way that it is possible to retrieve the actual text from the UI resource database.

Such markup should be included has part of the primary format, the directive mechanism should only be used if an equivalent markup does not already exist in the primary format.

Note: The XML Linking Language [XLink] specifies a standard vocabulary filling these requirements.

Example of possible usage:

<para>In the Desktop application: select the
  <ui xlink:href=""DeskApp.po#menu_file">File</ui> menu.</para>

3.9. Segmentation

It should be possible to delimit runs of data within a content that will be treated as segments by the localization tools.

As segmentation is an important factor in how much a modified document can recycle from the translation of its previous version, it is important to permit the possibility to set pre-defined segments in the source document directly. This allows the text to be segmented independently of which translation tool is used.

Examples of possible usage:

<ld:seg id="123">Version published on Tue. June 4th.</ld:seg> Done with the permission of the author.

<resData name="stringtable1"> <item id="100">Error found:\n<ld:segbreak/>See Log for details.</item> <item id="200">Error found:\n<ld:segbreak/>File %1s invalid.</item> </resData>

3.10. Language of the Content

Declaring the language of the content The main language (or languages of a truly multilingual document) must be declared at the beginning of any document, using industry standard approaches. Such declarations should also apply to any external parsed entities that are stored separately. Any content in another language within a document should be labeled appropriately. In addition, it must be possible to declare a single document as being composed of multilingual parts of equal standing (i.e. the document entity does not represent a single language).

A number of rendering practices will vary according to the locale of the text (i.e. the language and market region). Examples include text expansion, hyphenation, wrapping rules, color usage, fonts, spell checking, line height and inter-line spacing, quotation marks and other punctuation, etc. For the appropriate presentation to be applied automatically to documents in different languages it is essential to know the language of the text.

It would also be useful to indicate the locale of the document as a whole to facilitate both processing and identification of translated documents during localization and content management.

It should also be possible to indicate the language of the XML content for any element or range of text where the language differs from that of the document as a whole. Note that this includes graphics, audio and other unparsed entities, which may need labeling for or treatment specific to a given locale.

Note: The XML namespace defines a standard attribute filling these requirements: xml:lang [XML Lang].

Example of possible usage:

<myDoc xml:lang="en"> <para>This paragraph is in English. <para xml:lang="fr">Ce paragraphe est en français. </myDoc>

3.11. White Space Handling

It must be possible to specify whether a given element allows white spaces to be collapsed during translation, and the XML markup must appropriately handle spaces for non-Latin scripts (such as Thai, Japanese, Korean, and Chinese).

Knowing whether the white spaces in a given element are collapsible or not is important for proper matching when using translation memories tools.

Note: The XML namespace defines a standard attribute filling these requirements: xml:space [XML Space].

Example of possible usage:

<resData name="form123" xml:lang="preserve"> <item id="100"> Given Name:</item> <item id="101"> Family Name:</item> <item id="102">Email Address:</item> </resData>

3.12. Identification of Changes

It may be useful to have a common mechanism to markup changes (e.g. deletions and insertions) that have occurred in data between two versions of a same document.

Translation tools can benefit enormously of having a way to identify the areas of XML documents where the translatable text has been changed. Such mechanism can improve the handling of updates from one revision to the next and affect the cost of the localization process.

Examples of possible usage:

<data name="str_error1"> <value>Cannot find requested host (%1).</value> </data> <data name="mnu_file" ld:new="yes"> <value>&File</value> </data> <data name="str_resetnow" ld:change="yes"> <value>Do you want to reset the connection now?</value> </data>

<dt>Localization directive<ld:deleted>s</ld:deleted></dt> <dd>A <ld:deleted>chunk</ld:deleted><ld:inserted>piece</ld:inserted> of metadata providing localization-related information in a document instance.</dd>

3.13. Identification of Datatypes

It may be useful to have a standard mechanism to indicate the content of an element is of a special nature and requires a specific processing.

There are many cases, especially in resource-type documents, where the content of an element is in a format that requires to be processed with a secondary parser. For example, resources with variables or escaped characters, chunks of HTML codes, scripts, etc.

XML clearly allows other XML vocabularies to be embedded using namespaces, and such mechanism should be encouraged. However, for various reasons, this may be not how the developer of the resources designed its architecture, and it is rather frequent to see escaped XML tags, or CDATA sections, or other type of formats, mixed and treated as plain text in an XML document. Providing a way to identify these occurrences can allow localization tools to be more efficient when preparing the file for translation.

Examples of possible usage:

<resData> <str id="str123">String in plain text</str> <str id="err100" ld:datatype="text/java">The user {0} is disconnected.</str> <resData>

<resData> <str id="123" ld:datatype="text/html" >Press Cancel to stop the process.</str> <str id="str100" ld:datatype="text/c" >Error %d:\nToo many nodes to fit in memory.</str> <str id="126" ld:datatype="text/html" ><![CDATA[Select Exit to quit the application.]]></str> </resData>

A. References

[ISO]

International Organization for Standardization Web site.

[ITS Req]

ITS Requirements Working Draft. ITS Group, June 2001.

[LISA]

Localisation Industry Standards Association Web site.

[OASIS]

Organization for the Advancement of Structured Information Standards Web site.

[Unicode]

Unicode Consortium Web site.

[W3C]

World Wide Web Consortium Web site.

[XLink]

XML Linking Language (XLink) Version 1.0, W3C (World Wide Web Consortium), Jun 2001.

[XML Lang]

Language Identification section in Extensible Markup Language (XML) 1.0 Second Edition, W3C (World Wide Web Consortium), Oct 2000.

[XML Space]

White Space Handling section in Extensible Markup Language (XML) 1.0 Second Edition, W3C (World Wide Web Consortium), Oct 2000.

xliff message

Localization Directives Requirement

Proposal, 11 June 2002

Abstract

Table of Contents

Appendices

1. Introduction

1.1. Purpose

1.2. Terminology

1.3. Examples

2. Implementation Constraints

3. Requirements

3.1. Identification of Localizable Parts

3.2. Identification of Non-Localizable Parts

3.3. Identification of Terms

3.4. Notes to Localizers

3.5. Allowed Characters

3.6. Unique Identifier

3.7. Indication of Container Size

3.8. External References

3.9. Segmentation

3.10. Language of the Content

3.11. White Space Handling

3.12. Identification of Changes

3.13. Identification of Datatypes

A. References