[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: Re: [office] Fwd: New Last Call: 'Tags for Identifying Languages' toBCP
Hi David, David Faure wrote: > This might be relevant for us since we use fo:language to specify the language > of a run of text. Not for switching to it yet, of course, better keep following XSL > for now, but just in case any of you has input on the IETF draft. > > On this topic, I just noted that our fo:language is validated with [A-Za-z]{1,8} > (languageCode definition) > This basically means it's "an RFC3066 language code" but without country code. Yes. > Shouldn't we allow things like fr_CA? (or is that fr-CA ? I'm confused by the RFC > talking about a hyphen, I thought it was an underscore). My understanding of RFC3066 is that it uses a hyphen. The type specifications for "language", "languageCode" and "countryCode" have been derived directly from RFC3006 and XSL-FO. RFC3066 specifies a language as >The syntax of this tag in ABNF [RFC 2234] is: > > Language-Tag = Primary-subtag *( "-" Subtag ) > > Primary-subtag = 1*8ALPHA > > Subtag = 1*8(ALPHA / DIGIT) > > The productions ALPHA and DIGIT are imported from RFC 2234; they > denote respectively the characters A to Z in upper or lower case and > the digits from 0 to 9. The character "-" is HYPHEN-MINUS (ABNF: > %x2D). This definition is what we use for the type "language". In XSL, the datatype used for the language attribute is summarized as > A language-specifier in conformance with [RFC3066]. and > The language may be the language component of any RFC 3066 code (these > are derived from the ISO 639 language codes). That's from my understanding the "primary-subtag" of RFC 3006, that must not contain a hyphen (or underscore). However, my interpretation of XSL-FO may be wrong. Michael > > ---------- Forwarded Message ---------- > > Subject: New Last Call: 'Tags for Identifying Languages' to BCP > Date: Thu, 9 Dec 2004 09:56 am > From: The IESG <iesg-secretary@ietf.org> > To: IETF-Announce <ietf-announce@ietf.org> > > The IESG has been considering > > - 'Tags for Identifying Languages ' > <draft-phillips-langtags-08.txt> as a BCP > > There have been considerable changes to the document since the > initial last call, and the IESG would like the community to consider > the changes. In addition, the authors have prepared text describing > why this mechanism is needed as a replacement for the existing > procedure; it is included below. > > The IESG plans to make a decision in the next few weeks, and solicits > final comments on this action. Please send any comments to the > iesg@ietf.org or ietf@ietf.org mailing lists by 2005-01-05. > > The file can be obtained via > http://www.ietf.org/internet-drafts/draft-phillips-langtags-08.txt > > Author's discussion of drivers for this work: > > Reasons for Enhancing RFC 3066 > > RFC 3066 and its predecessor, RFC 1766, define language tags for use on the > Internet. Language tags are necessary for many applications, ranging from > cataloging content to computer processing of text. The RFC 3066 standard for > language tags has been widely adopted in various protocols and text formats, > including HTML, XML, and CLDR, as the best means of identifying languages and > language preferences. > > This specification proposes enhancements to RFC 3066. Because revisions to > RFC 3066 therefore have such broad implications, it is important to > understand the reasons for modifying the structure of language tags and the > design implications of the proposed replacement. > > Problems > > This specification, the proposed successor to RFC 3066, addresses a number of > issues that implementers of language tags have faced in recent years: > > * Stability of the underlying ISO standards > * Accessibility of the underlying ISO standards for implementers > * Ambiguity of the tags defined by these ISO standards > * Difficulty with registrations and their acceptance > * Identification of script where necessary > * Extensibility > > The stability, accessibility, and ambiguity issues are crucial. Currently, > because of changes in underlying ISO standards, a valid RFC 3066 language tag > may become invalid (or have its meaning change) at a later date. With much of > the world's computing infrastructure dependent on language tags, this is > simply unacceptable: it invalidates content that may have an extensive > shelf-life. In this specification, once a language tag is valid, it remains > valid forever. RFC 3066 Language Tags: A brief survey > > Tags defined by RFC 3066 take two forms. Most tags are formed using an ISO > 639-1 (two-letter) or ISO 639-2 (three letter) language tag, optionally > followed by an ISO 3166 country code. Tags formed in this manner are not > individually registered and anyone can use such a combination of codes to > identify their language preferences or the language of some piece of > content. Because this system allows a broad range of tags to be formed by > reference to the underlying standards, these tags are referred to as > generative in nature. The generative system is very powerful and allows > content authors and others to form and use very expressive tags without the > need to engage in a long and arduous registration process. Examples of such > tags are: > > * en-US (English as used in the United States) > * fr-CA (French as used in Canada) > * de-CH (German as used in Switzerland) > * ja (Japanese) > * ale-CA (Aleut as used in Canada) > * ale-BE (Aleut as used in Belgium) > > While it is possible to generate tags that do not identify any likely > real-world content, such as Aleut as used in Belgium, tags of this nature do > not represent a serious problem. Consider the case of a database that can > identify people by national origin and by hair color. It is not a problem > that one could compose a query for blond Mongolians, even though no results > would ever be returned. > > There are problems with the the RFC 3066 definition of generative tags, > however. The ISO 639 and ISO 3166 standards are not freely available and > evolve over time. For example, ISO 3166 has withdrawn tags in the past and, > worse, then reassigned them to a different country altogether. As a result, > it is difficult for implementers to obtain a correct list of codes and then > ensure > interoperability with other implementations of language tags. > > The other way to form an RFC 3066 tag is via registration with IANA. Tags > registered with IANA identify a specific language, dialect or variation. > Unlike the generative tags, the registered values cannot be combined with > other standard subtags to form additional tags that are more descriptive. > Examples of such tags are: > > * no-nyn (Nynorsk variation of Norwegian, > deprecated: use 'nn' instead) > * cel-gaulish > * i-klingon (deprecated: use 'tlh' instead) > * etc. > > Registration, besides being a long and arduous process, also presents a > variety of problems for implementers. Although the tags are freely > available, most implementations do not support these tags because they do > not fit neatly into the generative system. Special logic is required to > handle them, especially when performing language negotiation or fallback. In > addition, many of the tags are deprecated because the registration process > is less opaque and time-consuming than registering a language with ISO 639 > MA/RA has historically been. Eventually ISO 639 does catch up and assign the > language a code, resulting in overlapping tag choices. Implementations must > also deal with the implications of multiple valid tags identifying what is > essentially the same content. > > But most problematic is the lack of a relationship to the generative > mechanism. Since each variation of a tag must be separately registered, > language variations with a broad range of valid uses require an enormous > number of registrations. For example, there are 8 registrations to deal with > minor spelling reforms in the German language and these registrations cover > just three countries where German is commonly spoken--and no countries where > it is not the major language. Variations in languages with a broader > diffusion (such as Chinese) may require 20 or more registrations to gain > full coverage, sometimes of important distinctions. > > Solving the Problems > > This specification addresses each of these issues with a simple, elegant > design that is compatible with existing language tags and implementations. > > This compatibility exists on several levels. All language tags, both > generative and registered, that were valid under RFC 3066 are still valid > under this specification. In addition, and very importantly, language tags > that are newly defined by this specification are compatible with the ABNF > syntax, matching, parsing, and other mechanisms defined by RFC 3066. > > Thus for an implementation of RFC 3066, all of the new tags defined by this > specification are still in the form of valid registered tags, and will simply > be dealt with in whatever fashion the implementation used to handle future > registrations, those that were added to the registry after the > implementation was created. In other words, tags formed under this > specification that are unfamiliar to RFC 3066 implementations will be > treated by those implementations as if they were registered tags from a > future version of the 3066 registry. > > Subtags and the Registry > > The largest change in the specification is that it modifies the structure of > the language tag registry. Instead of having to obtain lists of codes from > five separate external standards (not all of which are easily available), > the IANA registry will maintain a comprehensive list of valid subtags that > can be used in the generative mechanism in a machine-parseable text format. > This registry will continue to track the existing core standards and will > start with the current list of valid codes. As future codes are assigned, > the IANA registry will be updated to reflect the changes. > > Having a separate registry allows IANA language tags to resolve ambiguity and > stability problems with the underlying standards. Language tags formed today > will be guaranteed to maintain their validity and meaning essentially > forever, something that is not true today. > > In addition, switching to a subtag registry changes the nature of > registrations themselves. Instead of registering complete tags and therefore > potentially having to register a very large number of them (complicating > life for implementers and discouraging support for the registry), a single > subtag can be generatively combined to form many useful tags. > > For example, one registered tag today is 'zh-Hans', which represents "Chinese > written in the Simplified Chinese script". Only this tag is valid under RFC > 3066. Useful tags such as 'zh-Hans-SG' (SG=Signapore) or 'zh-Hans-CN' are not > valid. By switching to a registry in which 'Hans' is a registered subtag, any > of these valid and useful tags can be formed generatively. > > In addition, the subtag registry will encourage implementers to support > registered items, since the subtags will fit the generative mechanism and > exception handling code will no longer be necessary. > > To prevent the IANA language registry filling up with deprecated entries, > rules have also been introduced to curb harmful registrations that should be > handled by the various ISO maintenance and registration authorities (such as > ISO 639). > > The new structure and registry allows implementations to determine much more > about tags, even in the absence of registry information. This is important > because at any given point in time there will be a mixture of implementations > that have different snapshots of the registry. The new structure allows these > implementations to to interoperate effectively. In particular, the category > of all subtags (as language, region, script, etc.) can be determined without > reference to the particular version of the registry snapshot by the > implementation. This allows for much more robust implementations, and greater > compatibility over time. > > In addition, this specification also makes it possible, for the first time, > to effectively test whether an implementation conforms to the specification. > The problem with RFC 3066 is that to determine the status of an > implementation produced at a given point, one has to reconstruct the > historical contents of each of the ISO standards and the historical contents > of the registry. This is a time-consuming and error-prone process. The new > registry provides a complete, easily parseable file which provides the > precise the contents of valid tags for any point in time. > > Additional Subtag Sources > > This specification introduces two additional international standards as > sources for language tags. > > ISO 15924 represents script codes. (The example above of 'Hans' is from ISO > 15924.) Writing system variations are often crucial to communicate, > especially when selecting content using language negotiation. Addition of > this standard will allow these distinctions to be formed generatively, > rather than via individual registration. > > UN M.49 represents region and country codes. The UN M.49 standard is used by > ISO 3166 to determine what a country is. The UN M.49 codes are used by this > specification in two ways. First, if ISO 3166 reassigns a country code > formerly associated with one country to another country (as it did in 2001 > with the 'CS' code, formerly Czechoslovakia and now assigned to Serbia and > Montenegro), then the UN M.49 code can be placed in the registry to preserve > stability. Secondly, the UN M.49 standard defines regional codes for areas > such as "Central and South America" which can be useful in forming language > tags for larger regions. > > Future-Proofing: Private Use and Extensions > > Because of the widespread use of language tags, it is potentially disruptive > to have periodic revisions of the core specification, despite demonstrated > need. This specification addresses this problem by fully specifying the > valid syntax of language tags, while providing for future, unforeseen, > requirements. One of these mechanisms is the extlang subtags, which allows > for future extensions of ISO 639, in particular, ISO 639-3. > > Private use subtags is another one of these mechanisms. In RFC 3066, any tag > that was not registered or wholly made up of generative subtags must be > completely tagged as private use. Recipients of such a tag are not allowed to > infer any information from such a tag, except by private agreement. Thus if > any private-use information needed to be included in the tag, the entire tag > had to be private use; making the entire tag uninterpretable to other > implementations. > > This specification allows for private use subtags in a particular, prescribed > manner. Consider the IANA registered tag 'sl-nedis', which represents the > Natisone dialect of Slovenian. The subtag 'sl' is a valid ISO 639-1 code for > Slovenian. Prior to its registration with IANA, if users wished to tag > content as being in the Natisone dialect, they had two choices for language > tags: 'sl' and 'x-sl-nedis' (or similar). The first tag does not meet the > need of distinguishing the text from other varieties of Slovenian, while the > second one does not convey the relationship to Slovenian to outside > processors (a human might look at the tag and infer Slovenian, but the 'sl' > subtag doesn't necessarily represent that language). > > Under this specification, if a new dialect of Slovenian were needed (let's > call it the 'xyzzy' dialect), a tag such as 'sl-x-xyzzy' can be used. In > fact, a quite comprehensive amount of information can be communicated: > 'sl-Latn-IT-x-xyzzy' would represent Slovenian written using the Latin script > as used in Italy with some additional private distinguishing information > (which implementations of this specification can match algorithmically). > > Note that RFC 3066 private use tags are still permitted and have the same > information content and treatment as they did previously. > > The extension mechanism also provides a way for independent RFCs to define > extensions to language tags. These extensions have a very constrained, > well-defined structure to prevent extensions from interfering with > implementations of this specification (or RFC 3066). > > Matching and Language Negotiation > > Content tagging is only one of the applications for language tags. The other > major applications are querying for for matches and in content negotiation. > RFC 3066 defines "language ranges" for use in content negotiation and > querying and describes a very simple matching algorithm. This specification > maintains compatibility with this language negotiation scheme, while > providing additional information on the implementation of language matching. > > Well-Formed vs. Validating > > Existing language tag processors already fall into two categories. There are > language tag processors that check if language tags have the proper, > well-formed, syntax, but which do not validate their content, and there are > language tag processors that in addition validate and reject unrecognized > tags. Each of these categories is appropriate to different implementations. > For example, to process incoming tags that may have been formed under a > future registry, an implementation may restrict itself to only checking > well-formedness. Another implementation that allows users to generate tags > may fully validate. > > This specification clearly distinguishes these two possible classes of > conformance, and provides an explicit, testable definition of each one. > Impact of the New Design on Existing Implementations > > One concern that is crucial to acceptance of the new language tag design is > how it works with existing implementations of RFC 3066 and how existing > implementations will interact with implementations of the newer language > tags. > > It is important to recognize that all language tags that were valid under the > existing RFC 3066 will remain valid, with their meanings intact, under this > specification. In fact, this specification stabilizes these meanings so that > existing implementations can be continued forward for as long as it > necessary. Content, regardless of its format, will remain valid, essentially > forever. > > As content and systems begin to make use of the new language tags by adopting > the additional fields defined by this specification, there will be an impact > on software and systems that expect only the older tags. The design of this > specification was carefully created so that all of the new values that can > be assigned fit the pattern for registered language tags under RFC 3066. > Thus while existing implementations will not recognize the meaning in the > tags, they will be able to process them as if they were > unrecognized-but-well-formed registered tags. > > In addition, although this specification acknowledges the possibility of > alternate or advanced matching and negotiation strategies, it maintains the > existing matching algorithm (by removing subtags from the right side of a > language tag until a match is obtained), simply providing more detail on > usage. > > Summary > > The authors of this specification have worked for the past year with a wide > range of experts in the language tagging community to build consensus on a > design for language tags that meets the needs and requirements of the user > community. Language tags form a basic building block for natural language > support in computer systems and content. The revision proposed in this > specification addresses the needs of this community of users with a minimal > impact on existing content and implementations, while providing a stable > basis for future development, expansion, and improvement. > > _______________________________________________ > IETF-Announce mailing list > IETF-Announce@ietf.org > https://www1.ietf.org/mailman/listinfo/ietf-announce > > ------------------------------------------------------- > -- Michael Brauer Phone: +49 40 23646 500 Technical Architect Software Engineering Fax: +49 40 23646 550 StarOffice Development Sun Microsystems GmbH Sachsenfeld 4 D-20097 Hamburg, Germany e-mail: michael.brauer@sun.com
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]