office message

Subject: Fwd: New Last Call: 'Tags for Identifying Languages' to BCP
From: David Faure <faure@kde.org>
To: office@lists.oasis-open.org
Date: Thu, 9 Dec 2004 15:37:43 +0100
This might be relevant for us since we use fo:language to specify the language
of a run of text. Not for switching to it yet, of course, better keep following XSL
for now, but just in case any of you has input on the IETF draft.

On this topic, I just noted that our fo:language is validated with [A-Za-z]{1,8} 
(languageCode definition)
This basically means it's "an RFC3066 language code" but without country code.
Shouldn't we allow things like fr_CA? (or is that fr-CA ? I'm confused by the RFC
talking about a hyphen, I thought it was an underscore).

----------  Forwarded Message  ----------

Subject: New Last Call: 'Tags for Identifying Languages' to BCP
Date: Thu, 9 Dec 2004 09:56 am
From: The IESG <iesg-secretary@ietf.org>
To: IETF-Announce <ietf-announce@ietf.org>

The IESG has been considering

- 'Tags for Identifying Languages '
   <draft-phillips-langtags-08.txt> as a BCP

There have been considerable changes to the document since the
initial last call, and the IESG would like the community to consider
the changes.  In addition, the authors have prepared text describing
why this mechanism is needed as a replacement for the existing
procedure; it is included below.

The IESG plans to make a decision in the next few weeks, and solicits
final comments on this action.  Please send any comments to the
iesg@ietf.org or ietf@ietf.org mailing lists by 2005-01-05.

The file can be obtained via
http://www.ietf.org/internet-drafts/draft-phillips-langtags-08.txt

Author's discussion of drivers for this work:

Reasons for Enhancing RFC 3066

RFC 3066 and its predecessor, RFC 1766, define language tags for use on the
Internet. Language tags are necessary for many applications, ranging from
cataloging content to computer processing of text. The RFC 3066 standard for
language tags has been widely adopted in various protocols and text formats,
including HTML, XML, and CLDR, as the best means of identifying languages and
language preferences.

This specification proposes enhancements to RFC 3066. Because revisions to
 RFC 3066 therefore have such broad implications, it is important to
 understand the reasons for modifying the structure of language tags and the
 design implications of the proposed replacement.

Problems

This specification, the proposed successor to RFC 3066, addresses a number of
issues that implementers of language tags have faced in recent years:

    * Stability of the underlying ISO standards
    * Accessibility of the underlying ISO standards for implementers
    * Ambiguity of the tags defined by these ISO standards
    * Difficulty with registrations and their acceptance
    * Identification of script where necessary
    * Extensibility

The stability, accessibility, and ambiguity issues are crucial. Currently,
because of changes in underlying ISO standards, a valid RFC 3066 language tag
may become invalid (or have its meaning change) at a later date. With much of
the world's computing infrastructure dependent on language tags, this is
 simply unacceptable: it invalidates content that may have an extensive
 shelf-life. In this specification, once a language tag is valid, it remains
 valid forever. RFC 3066 Language Tags: A brief survey

Tags defined by RFC 3066 take two forms. Most tags are formed using an ISO
639-1 (two-letter) or ISO 639-2 (three letter) language tag, optionally
 followed by an ISO 3166 country code. Tags formed in this manner are not
 individually registered and anyone can use such a combination of codes to
 identify their language preferences or the language of some piece of
 content. Because this system allows a broad range of tags to be formed by
 reference to the underlying standards, these tags are referred to as
 generative in nature. The generative system is very powerful and allows
 content authors and others to form and use very expressive tags without the
 need to engage in a long and arduous registration process. Examples of such
 tags are:

    * en-US (English as used in the United States)
    * fr-CA (French as used in Canada)
    * de-CH (German as used in Switzerland)
    * ja (Japanese)
    * ale-CA (Aleut as used in Canada)
    * ale-BE (Aleut as used in Belgium)

While it is possible to generate tags that do not identify any likely
real-world content, such as Aleut as used in Belgium, tags of this nature do
 not represent a serious problem. Consider the case of a database that can
 identify people by national origin and by hair color. It is not a problem
 that one could compose a query for blond Mongolians, even though no results
 would ever be returned.

There are problems with the the RFC 3066 definition of generative tags,
however. The ISO 639 and ISO 3166 standards are not freely available and
 evolve over time. For example, ISO 3166 has withdrawn tags in the past and,
 worse, then reassigned them to a different country altogether. As a result,
 it is difficult for implementers to obtain a correct list of codes and then
 ensure
interoperability with other implementations of language tags.

The other way to form an RFC 3066 tag is via registration with IANA. Tags
registered with IANA identify a specific language, dialect or variation.
 Unlike the generative tags, the registered values cannot be combined with
 other standard subtags to form additional tags that are more descriptive.
 Examples of such tags are:

    * no-nyn (Nynorsk variation of Norwegian,
              deprecated: use 'nn' instead)
    * cel-gaulish
    * i-klingon (deprecated: use 'tlh' instead)
    * etc.

Registration, besides being a long and arduous process, also presents a
 variety of problems for implementers. Although the tags are freely
 available, most implementations do not support these tags because they do
 not fit neatly into the generative system. Special logic is required to
 handle them, especially when performing language negotiation or fallback. In
 addition, many of the tags are deprecated because the registration process
 is less opaque and time-consuming than registering a language with ISO 639
 MA/RA has historically been. Eventually ISO 639 does catch up and assign the
 language a code, resulting in overlapping tag choices. Implementations must
 also deal with the implications of multiple valid tags identifying what is
 essentially the same content.

But most problematic is the lack of a relationship to the generative
 mechanism. Since each variation of a tag must be separately registered,
 language variations with a broad range of valid uses require an enormous
 number of registrations. For example, there are 8 registrations to deal with
 minor spelling reforms in the German language and these registrations cover
 just three countries where German is commonly spoken--and no countries where
 it is not the major language. Variations in languages with a broader
 diffusion (such as Chinese) may require 20 or more registrations to gain
 full coverage, sometimes of important distinctions.

Solving the Problems

This specification addresses each of these issues with a simple, elegant
 design that is compatible with existing language tags and implementations.

This compatibility exists on several levels. All language tags, both
 generative and registered, that were valid under RFC 3066 are still valid
 under this specification. In addition, and very importantly, language tags
 that are newly defined by this specification are compatible with the ABNF
 syntax, matching, parsing, and other mechanisms defined by RFC 3066.

Thus for an implementation of RFC 3066, all of the new tags defined by this
specification are still in the form of valid registered tags, and will simply
 be dealt with in whatever fashion the implementation used to handle future
 registrations, those that were added to the registry after the
 implementation was created. In other words, tags formed under this
 specification that are unfamiliar to RFC 3066 implementations will be
 treated by those implementations as if they were registered tags from a
 future version of the 3066 registry.

Subtags and the Registry

The largest change in the specification is that it modifies the structure of
the language tag registry. Instead of having to obtain lists of codes from
 five separate external standards (not all of which are easily available),
 the IANA registry will maintain a comprehensive list of valid subtags that
 can be used in the generative mechanism in a machine-parseable text format.
 This registry will continue to track the existing core standards and will
 start with the current list of valid codes. As future codes are assigned,
 the IANA registry will be updated to reflect the changes.

Having a separate registry allows IANA language tags to resolve ambiguity and
stability problems with the underlying standards. Language tags formed today
will be guaranteed to maintain their validity and meaning essentially
 forever, something that is not true today.

In addition, switching to a subtag registry changes the nature of
 registrations themselves. Instead of registering complete tags and therefore
 potentially having to register a very large number of them (complicating
 life for implementers and discouraging support for the registry), a single
 subtag can be generatively combined to form many useful tags.

For example, one registered tag today is 'zh-Hans', which represents "Chinese
written in the Simplified Chinese script". Only this tag is valid under RFC
3066. Useful tags such as 'zh-Hans-SG' (SG=Signapore) or 'zh-Hans-CN' are not
valid. By switching to a registry in which 'Hans' is a registered subtag, any
 of these valid and useful tags can be formed generatively.

In addition, the subtag registry will encourage implementers to support
registered items, since the subtags will fit the generative mechanism and
exception handling code will no longer be necessary.

To prevent the IANA language registry filling up with deprecated entries,
 rules have also been introduced to curb harmful registrations that should be
 handled by the various ISO maintenance and registration authorities (such as
 ISO 639).

The new structure and registry allows implementations to determine much more
about tags, even in the absence of registry information. This is important
because at any given point in time there will be a mixture of implementations
that have different snapshots of the registry. The new structure allows these
implementations to to interoperate effectively. In particular, the category
 of all subtags (as language, region, script, etc.) can be determined without
 reference to the particular version of the registry snapshot by the
implementation. This allows for much more robust implementations, and greater
compatibility over time.

In addition, this specification also makes it possible, for the first time,
 to effectively test whether an implementation conforms to the specification.
 The problem with RFC 3066 is that to determine the status of an
 implementation produced at a given point, one has to reconstruct the
 historical contents of each of the ISO standards and the historical contents
 of the registry. This is a time-consuming and error-prone process. The new
 registry provides a complete, easily parseable file which provides the
 precise the contents of valid tags for any point in time.

Additional Subtag Sources

This specification introduces two additional international standards as
 sources for language tags.

ISO 15924 represents script codes. (The example above of 'Hans' is from ISO
15924.) Writing system variations are often crucial to communicate,
 especially when selecting content using language negotiation. Addition of
 this standard will allow these distinctions to be formed generatively,
 rather than via individual registration.

UN M.49 represents region and country codes. The UN M.49 standard is used by
ISO 3166 to determine what a country is. The UN M.49 codes are used by this
specification in two ways. First, if ISO 3166 reassigns a country code
 formerly associated with one country to another country (as it did in 2001
 with the 'CS' code, formerly Czechoslovakia and now assigned to Serbia and
 Montenegro), then the UN M.49 code can be placed in the registry to preserve
 stability. Secondly, the UN M.49 standard defines regional codes for areas
 such as "Central and South America" which can be useful in forming language
 tags for larger regions.

Future-Proofing: Private Use and Extensions

Because of the widespread use of language tags, it is potentially disruptive
 to have periodic revisions of the core specification, despite demonstrated
 need. This specification addresses this problem by fully specifying the
 valid syntax of language tags, while providing for future, unforeseen,
 requirements. One of these mechanisms is the extlang subtags, which allows
 for future extensions of ISO 639, in particular, ISO 639-3.

Private use subtags is another one of these mechanisms. In RFC 3066, any tag
that was not registered or wholly made up of generative subtags must be
completely tagged as private use. Recipients of such a tag are not allowed to
infer any information from such a tag, except by private agreement. Thus if
 any private-use information needed to be included in the tag, the entire tag
 had to be private use; making the entire tag uninterpretable to other
 implementations.

This specification allows for private use subtags in a particular, prescribed
manner. Consider the IANA registered tag 'sl-nedis', which represents the
Natisone dialect of Slovenian. The subtag 'sl' is a valid ISO 639-1 code for
Slovenian. Prior to its registration with IANA, if users wished to tag
 content as being in the Natisone dialect, they had two choices for language
 tags: 'sl' and 'x-sl-nedis' (or similar). The first tag does not meet the
 need of distinguishing the text from other varieties of Slovenian, while the
 second one does not convey the relationship to Slovenian to outside
 processors (a human might look at the tag and infer Slovenian, but the 'sl'
 subtag doesn't necessarily represent that language).

Under this specification, if a new dialect of Slovenian were needed (let's
 call it the 'xyzzy' dialect), a tag such as 'sl-x-xyzzy' can be used. In
 fact, a quite comprehensive amount of information can be communicated:
'sl-Latn-IT-x-xyzzy' would represent Slovenian written using the Latin script
 as used in Italy with some additional private distinguishing information
 (which implementations of this specification can match algorithmically).

Note that RFC 3066 private use tags are still permitted and have the same
information content and treatment as they did previously.

The extension mechanism also provides a way for independent RFCs to define
extensions to language tags. These extensions have a very constrained,
well-defined structure to prevent extensions from interfering with
implementations of this specification (or RFC 3066).

Matching and Language Negotiation

Content tagging is only one of the applications for language tags. The other
major applications are querying for for matches and in content negotiation.
 RFC 3066 defines "language ranges" for use in content negotiation and
 querying and describes a very simple matching algorithm. This specification
 maintains compatibility with this language negotiation scheme, while
 providing additional information on the implementation of language matching.

Well-Formed vs. Validating

Existing language tag processors already fall into two categories. There are
language tag processors that check if language tags have the proper,
well-formed, syntax, but which do not validate their content, and there are
language tag processors that in addition validate and reject unrecognized
 tags. Each of these categories is appropriate to different implementations.
 For example, to process incoming tags that may have been formed under a
 future registry, an implementation may restrict itself to only checking
well-formedness. Another implementation that allows users to generate tags
 may fully validate.

This specification clearly distinguishes these two possible classes of
conformance, and provides an explicit, testable definition of each one.
Impact of the New Design on Existing Implementations

One concern that is crucial to acceptance of the new language tag design is
 how it works with existing implementations of RFC 3066 and how existing
implementations will interact with implementations of the newer language
 tags.

It is important to recognize that all language tags that were valid under the
existing RFC 3066 will remain valid, with their meanings intact, under this
specification. In fact, this specification stabilizes these meanings so that
existing implementations can be continued forward for as long as it
 necessary. Content, regardless of its format, will remain valid, essentially
 forever.

As content and systems begin to make use of the new language tags by adopting
the additional fields defined by this specification, there will be an impact
 on software and systems that expect only the older tags. The design of this
 specification was carefully created so that all of the new values that can
 be assigned fit the pattern for registered language tags under RFC 3066.
 Thus while existing implementations will not recognize the meaning in the
 tags, they will be able to process them as if they were
 unrecognized-but-well-formed registered tags.

In addition, although this specification acknowledges the possibility of
alternate or advanced matching and negotiation strategies, it maintains the
existing matching algorithm (by removing subtags from the right side of a
language tag until a match is obtained), simply providing more detail on
 usage.

Summary

The authors of this specification have worked for the past year with a wide
range of experts in the language tagging community to build consensus on a
design for language tags that meets the needs and requirements of the user
community. Language tags form a basic building block for natural language
support in computer systems and content. The revision proposed in this
specification addresses the needs of this community of users with a minimal
impact on existing content and implementations, while providing a stable
 basis for future development, expansion, and improvement.

_______________________________________________
IETF-Announce mailing list
IETF-Announce@ietf.org
https://www1.ietf.org/mailman/listinfo/ietf-announce

-------------------------------------------------------

-- 
David Faure, faure@kde.org, sponsored by Trolltech to work on KDE,
Konqueror (http://www.konqueror.org), and KOffice (http://www.koffice.org).
Follow-Ups:
- Re: [office] Fwd: New Last Call: 'Tags for Identifying Languages' toBCP
  - From: Michael Brauer <Michael.Brauer@Sun.COM>