office message

Subject: Re: [office] Fwd: New Last Call: 'Tags for Identifying Languages' toBCP
From: Michael Brauer <Michael.Brauer@Sun.COM>
To: David Faure <faure@kde.org>
Date: Mon, 13 Dec 2004 12:32:05 +0100
Hi David,

David Faure wrote:
> This might be relevant for us since we use fo:language to specify the language
> of a run of text. Not for switching to it yet, of course, better keep following XSL
> for now, but just in case any of you has input on the IETF draft.
> 
> On this topic, I just noted that our fo:language is validated with [A-Za-z]{1,8} 
> (languageCode definition)
> This basically means it's "an RFC3066 language code" but without country code.

Yes.

> Shouldn't we allow things like fr_CA? (or is that fr-CA ? I'm confused by the RFC
> talking about a hyphen, I thought it was an underscore).

My understanding of RFC3066 is that it uses a hyphen. The type
specifications for "language", "languageCode" and "countryCode" have
been derived directly from RFC3006 and XSL-FO.

RFC3066 specifies a language as

>The syntax of this tag in ABNF [RFC 2234] is:
>
>    Language-Tag = Primary-subtag *( "-" Subtag )
>
>    Primary-subtag = 1*8ALPHA
>
>    Subtag = 1*8(ALPHA / DIGIT)
>
>   The productions ALPHA and DIGIT are imported from RFC 2234; they
>   denote respectively the characters A to Z in upper or lower case and
>   the digits from 0 to 9.  The character "-" is HYPHEN-MINUS (ABNF:
>   %x2D).

This definition is what we use for the type "language".

In XSL, the datatype used for the language attribute is summarized as

> A language-specifier in conformance with [RFC3066].

and

> The language may be the language component of any RFC 3066 code (these
> are derived from the ISO 639 language codes).

That's from my understanding the "primary-subtag" of RFC 3006, that must
not contain a hyphen (or underscore).

However, my interpretation of XSL-FO may be wrong.

Michael











> 
> ----------  Forwarded Message  ----------
> 
> Subject: New Last Call: 'Tags for Identifying Languages' to BCP
> Date: Thu, 9 Dec 2004 09:56 am
> From: The IESG <iesg-secretary@ietf.org>
> To: IETF-Announce <ietf-announce@ietf.org>
> 
> The IESG has been considering
> 
> - 'Tags for Identifying Languages '
>    <draft-phillips-langtags-08.txt> as a BCP
> 
> There have been considerable changes to the document since the
> initial last call, and the IESG would like the community to consider
> the changes.  In addition, the authors have prepared text describing
> why this mechanism is needed as a replacement for the existing
> procedure; it is included below.
> 
> The IESG plans to make a decision in the next few weeks, and solicits
> final comments on this action.  Please send any comments to the
> iesg@ietf.org or ietf@ietf.org mailing lists by 2005-01-05.
> 
> The file can be obtained via
> http://www.ietf.org/internet-drafts/draft-phillips-langtags-08.txt
> 
> Author's discussion of drivers for this work:
> 
> Reasons for Enhancing RFC 3066
> 
> RFC 3066 and its predecessor, RFC 1766, define language tags for use on the
> Internet. Language tags are necessary for many applications, ranging from
> cataloging content to computer processing of text. The RFC 3066 standard for
> language tags has been widely adopted in various protocols and text formats,
> including HTML, XML, and CLDR, as the best means of identifying languages and
> language preferences.
> 
> This specification proposes enhancements to RFC 3066. Because revisions to
>  RFC 3066 therefore have such broad implications, it is important to
>  understand the reasons for modifying the structure of language tags and the
>  design implications of the proposed replacement.
> 
> Problems
> 
> This specification, the proposed successor to RFC 3066, addresses a number of
> issues that implementers of language tags have faced in recent years:
> 
>     * Stability of the underlying ISO standards
>     * Accessibility of the underlying ISO standards for implementers
>     * Ambiguity of the tags defined by these ISO standards
>     * Difficulty with registrations and their acceptance
>     * Identification of script where necessary
>     * Extensibility
> 
> The stability, accessibility, and ambiguity issues are crucial. Currently,
> because of changes in underlying ISO standards, a valid RFC 3066 language tag
> may become invalid (or have its meaning change) at a later date. With much of
> the world's computing infrastructure dependent on language tags, this is
>  simply unacceptable: it invalidates content that may have an extensive
>  shelf-life. In this specification, once a language tag is valid, it remains
>  valid forever. RFC 3066 Language Tags: A brief survey
> 
> Tags defined by RFC 3066 take two forms. Most tags are formed using an ISO
> 639-1 (two-letter) or ISO 639-2 (three letter) language tag, optionally
>  followed by an ISO 3166 country code. Tags formed in this manner are not
>  individually registered and anyone can use such a combination of codes to
>  identify their language preferences or the language of some piece of
>  content. Because this system allows a broad range of tags to be formed by
>  reference to the underlying standards, these tags are referred to as
>  generative in nature. The generative system is very powerful and allows
>  content authors and others to form and use very expressive tags without the
>  need to engage in a long and arduous registration process. Examples of such
>  tags are:
> 
>     * en-US (English as used in the United States)
>     * fr-CA (French as used in Canada)
>     * de-CH (German as used in Switzerland)
>     * ja (Japanese)
>     * ale-CA (Aleut as used in Canada)
>     * ale-BE (Aleut as used in Belgium)
> 
> While it is possible to generate tags that do not identify any likely
> real-world content, such as Aleut as used in Belgium, tags of this nature do
>  not represent a serious problem. Consider the case of a database that can
>  identify people by national origin and by hair color. It is not a problem
>  that one could compose a query for blond Mongolians, even though no results
>  would ever be returned.
> 
> There are problems with the the RFC 3066 definition of generative tags,
> however. The ISO 639 and ISO 3166 standards are not freely available and
>  evolve over time. For example, ISO 3166 has withdrawn tags in the past and,
>  worse, then reassigned them to a different country altogether. As a result,
>  it is difficult for implementers to obtain a correct list of codes and then
>  ensure
> interoperability with other implementations of language tags.
> 
> The other way to form an RFC 3066 tag is via registration with IANA. Tags
> registered with IANA identify a specific language, dialect or variation.
>  Unlike the generative tags, the registered values cannot be combined with
>  other standard subtags to form additional tags that are more descriptive.
>  Examples of such tags are:
> 
>     * no-nyn (Nynorsk variation of Norwegian,
>               deprecated: use 'nn' instead)
>     * cel-gaulish
>     * i-klingon (deprecated: use 'tlh' instead)
>     * etc.
> 
> Registration, besides being a long and arduous process, also presents a
>  variety of problems for implementers. Although the tags are freely
>  available, most implementations do not support these tags because they do
>  not fit neatly into the generative system. Special logic is required to
>  handle them, especially when performing language negotiation or fallback. In
>  addition, many of the tags are deprecated because the registration process
>  is less opaque and time-consuming than registering a language with ISO 639
>  MA/RA has historically been. Eventually ISO 639 does catch up and assign the
>  language a code, resulting in overlapping tag choices. Implementations must
>  also deal with the implications of multiple valid tags identifying what is
>  essentially the same content.
> 
> But most problematic is the lack of a relationship to the generative
>  mechanism. Since each variation of a tag must be separately registered,
>  language variations with a broad range of valid uses require an enormous
>  number of registrations. For example, there are 8 registrations to deal with
>  minor spelling reforms in the German language and these registrations cover
>  just three countries where German is commonly spoken--and no countries where
>  it is not the major language. Variations in languages with a broader
>  diffusion (such as Chinese) may require 20 or more registrations to gain
>  full coverage, sometimes of important distinctions.
> 
> Solving the Problems
> 
> This specification addresses each of these issues with a simple, elegant
>  design that is compatible with existing language tags and implementations.
> 
> This compatibility exists on several levels. All language tags, both
>  generative and registered, that were valid under RFC 3066 are still valid
>  under this specification. In addition, and very importantly, language tags
>  that are newly defined by this specification are compatible with the ABNF
>  syntax, matching, parsing, and other mechanisms defined by RFC 3066.
> 
> Thus for an implementation of RFC 3066, all of the new tags defined by this
> specification are still in the form of valid registered tags, and will simply
>  be dealt with in whatever fashion the implementation used to handle future
>  registrations, those that were added to the registry after the
>  implementation was created. In other words, tags formed under this
>  specification that are unfamiliar to RFC 3066 implementations will be
>  treated by those implementations as if they were registered tags from a
>  future version of the 3066 registry.
> 
> Subtags and the Registry
> 
> The largest change in the specification is that it modifies the structure of
> the language tag registry. Instead of having to obtain lists of codes from
>  five separate external standards (not all of which are easily available),
>  the IANA registry will maintain a comprehensive list of valid subtags that
>  can be used in the generative mechanism in a machine-parseable text format.
>  This registry will continue to track the existing core standards and will
>  start with the current list of valid codes. As future codes are assigned,
>  the IANA registry will be updated to reflect the changes.
> 
> Having a separate registry allows IANA language tags to resolve ambiguity and
> stability problems with the underlying standards. Language tags formed today
> will be guaranteed to maintain their validity and meaning essentially
>  forever, something that is not true today.
> 
> In addition, switching to a subtag registry changes the nature of
>  registrations themselves. Instead of registering complete tags and therefore
>  potentially having to register a very large number of them (complicating
>  life for implementers and discouraging support for the registry), a single
>  subtag can be generatively combined to form many useful tags.
> 
> For example, one registered tag today is 'zh-Hans', which represents "Chinese
> written in the Simplified Chinese script". Only this tag is valid under RFC
> 3066. Useful tags such as 'zh-Hans-SG' (SG=Signapore) or 'zh-Hans-CN' are not
> valid. By switching to a registry in which 'Hans' is a registered subtag, any
>  of these valid and useful tags can be formed generatively.
> 
> In addition, the subtag registry will encourage implementers to support
> registered items, since the subtags will fit the generative mechanism and
> exception handling code will no longer be necessary.
> 
> To prevent the IANA language registry filling up with deprecated entries,
>  rules have also been introduced to curb harmful registrations that should be
>  handled by the various ISO maintenance and registration authorities (such as
>  ISO 639).
> 
> The new structure and registry allows implementations to determine much more
> about tags, even in the absence of registry information. This is important
> because at any given point in time there will be a mixture of implementations
> that have different snapshots of the registry. The new structure allows these
> implementations to to interoperate effectively. In particular, the category
>  of all subtags (as language, region, script, etc.) can be determined without
>  reference to the particular version of the registry snapshot by the
> implementation. This allows for much more robust implementations, and greater
> compatibility over time.
> 
> In addition, this specification also makes it possible, for the first time,
>  to effectively test whether an implementation conforms to the specification.
>  The problem with RFC 3066 is that to determine the status of an
>  implementation produced at a given point, one has to reconstruct the
>  historical contents of each of the ISO standards and the historical contents
>  of the registry. This is a time-consuming and error-prone process. The new
>  registry provides a complete, easily parseable file which provides the
>  precise the contents of valid tags for any point in time.
> 
> Additional Subtag Sources
> 
> This specification introduces two additional international standards as
>  sources for language tags.
> 
> ISO 15924 represents script codes. (The example above of 'Hans' is from ISO
> 15924.) Writing system variations are often crucial to communicate,
>  especially when selecting content using language negotiation. Addition of
>  this standard will allow these distinctions to be formed generatively,
>  rather than via individual registration.
> 
> UN M.49 represents region and country codes. The UN M.49 standard is used by
> ISO 3166 to determine what a country is. The UN M.49 codes are used by this
> specification in two ways. First, if ISO 3166 reassigns a country code
>  formerly associated with one country to another country (as it did in 2001
>  with the 'CS' code, formerly Czechoslovakia and now assigned to Serbia and
>  Montenegro), then the UN M.49 code can be placed in the registry to preserve
>  stability. Secondly, the UN M.49 standard defines regional codes for areas
>  such as "Central and South America" which can be useful in forming language
>  tags for larger regions.
> 
> Future-Proofing: Private Use and Extensions
> 
> Because of the widespread use of language tags, it is potentially disruptive
>  to have periodic revisions of the core specification, despite demonstrated
>  need. This specification addresses this problem by fully specifying the
>  valid syntax of language tags, while providing for future, unforeseen,
>  requirements. One of these mechanisms is the extlang subtags, which allows
>  for future extensions of ISO 639, in particular, ISO 639-3.
> 
> Private use subtags is another one of these mechanisms. In RFC 3066, any tag
> that was not registered or wholly made up of generative subtags must be
> completely tagged as private use. Recipients of such a tag are not allowed to
> infer any information from such a tag, except by private agreement. Thus if
>  any private-use information needed to be included in the tag, the entire tag
>  had to be private use; making the entire tag uninterpretable to other
>  implementations.
> 
> This specification allows for private use subtags in a particular, prescribed
> manner. Consider the IANA registered tag 'sl-nedis', which represents the
> Natisone dialect of Slovenian. The subtag 'sl' is a valid ISO 639-1 code for
> Slovenian. Prior to its registration with IANA, if users wished to tag
>  content as being in the Natisone dialect, they had two choices for language
>  tags: 'sl' and 'x-sl-nedis' (or similar). The first tag does not meet the
>  need of distinguishing the text from other varieties of Slovenian, while the
>  second one does not convey the relationship to Slovenian to outside
>  processors (a human might look at the tag and infer Slovenian, but the 'sl'
>  subtag doesn't necessarily represent that language).
> 
> Under this specification, if a new dialect of Slovenian were needed (let's
>  call it the 'xyzzy' dialect), a tag such as 'sl-x-xyzzy' can be used. In
>  fact, a quite comprehensive amount of information can be communicated:
> 'sl-Latn-IT-x-xyzzy' would represent Slovenian written using the Latin script
>  as used in Italy with some additional private distinguishing information
>  (which implementations of this specification can match algorithmically).
> 
> Note that RFC 3066 private use tags are still permitted and have the same
> information content and treatment as they did previously.
> 
> The extension mechanism also provides a way for independent RFCs to define
> extensions to language tags. These extensions have a very constrained,
> well-defined structure to prevent extensions from interfering with
> implementations of this specification (or RFC 3066).
> 
> Matching and Language Negotiation
> 
> Content tagging is only one of the applications for language tags. The other
> major applications are querying for for matches and in content negotiation.
>  RFC 3066 defines "language ranges" for use in content negotiation and
>  querying and describes a very simple matching algorithm. This specification
>  maintains compatibility with this language negotiation scheme, while
>  providing additional information on the implementation of language matching.
> 
> Well-Formed vs. Validating
> 
> Existing language tag processors already fall into two categories. There are
> language tag processors that check if language tags have the proper,
> well-formed, syntax, but which do not validate their content, and there are
> language tag processors that in addition validate and reject unrecognized
>  tags. Each of these categories is appropriate to different implementations.
>  For example, to process incoming tags that may have been formed under a
>  future registry, an implementation may restrict itself to only checking
> well-formedness. Another implementation that allows users to generate tags
>  may fully validate.
> 
> This specification clearly distinguishes these two possible classes of
> conformance, and provides an explicit, testable definition of each one.
> Impact of the New Design on Existing Implementations
> 
> One concern that is crucial to acceptance of the new language tag design is
>  how it works with existing implementations of RFC 3066 and how existing
> implementations will interact with implementations of the newer language
>  tags.
> 
> It is important to recognize that all language tags that were valid under the
> existing RFC 3066 will remain valid, with their meanings intact, under this
> specification. In fact, this specification stabilizes these meanings so that
> existing implementations can be continued forward for as long as it
>  necessary. Content, regardless of its format, will remain valid, essentially
>  forever.
> 
> As content and systems begin to make use of the new language tags by adopting
> the additional fields defined by this specification, there will be an impact
>  on software and systems that expect only the older tags. The design of this
>  specification was carefully created so that all of the new values that can
>  be assigned fit the pattern for registered language tags under RFC 3066.
>  Thus while existing implementations will not recognize the meaning in the
>  tags, they will be able to process them as if they were
>  unrecognized-but-well-formed registered tags.
> 
> In addition, although this specification acknowledges the possibility of
> alternate or advanced matching and negotiation strategies, it maintains the
> existing matching algorithm (by removing subtags from the right side of a
> language tag until a match is obtained), simply providing more detail on
>  usage.
> 
> Summary
> 
> The authors of this specification have worked for the past year with a wide
> range of experts in the language tagging community to build consensus on a
> design for language tags that meets the needs and requirements of the user
> community. Language tags form a basic building block for natural language
> support in computer systems and content. The revision proposed in this
> specification addresses the needs of this community of users with a minimal
> impact on existing content and implementations, while providing a stable
>  basis for future development, expansion, and improvement.
> 
> _______________________________________________
> IETF-Announce mailing list
> IETF-Announce@ietf.org
> https://www1.ietf.org/mailman/listinfo/ietf-announce
> 
> -------------------------------------------------------
> 


-- 
Michael Brauer                                Phone:  +49 40 23646 500
Technical Architect Software Engineering      Fax:    +49 40 23646 550
StarOffice Development
Sun Microsystems GmbH
Sachsenfeld 4
D-20097 Hamburg, Germany                e-mail: michael.brauer@sun.com
Follow-Ups:
- Re: [office] Fwd: New Last Call: 'Tags for Identifying Languages' to BCP
  - From: David Faure <faure@kde.org>
References:
- Fwd: New Last Call: 'Tags for Identifying Languages' to BCP
  - From: David Faure <faure@kde.org>