[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [Elist Home]
Subject: Re: natural language
All, Yuta makes the following proposal for labeling language specific entries in the OASIS registry/repository and asks Lisa what items she thinks should be so labeled. >From Yuta Yoshida, March 31 2000 >Metadata of metadata > MetaDataLang: String > - iso639 + iso3166, specified in RFC1766. > ex. en-US, ja-JP > MetaDataEncoding : String > - character set name or alias assigined in IANA. > ex. ISO_8859-1, EUC-JP > >For the component item(metadata of registered item) > ItemLang: String > ItemEncoding: String > >I don't know where the first one(metadata for metadata) should go >into. They don't seem to fit into any of 'related data' discussions. >Lisa, going back to your first mail, where did you want to put >the language information into? Lisa and I discussed this for a bit this morning and have a proposal to make. We think there are three things to consider, the two items Yuta defines above and "multiplicity", where multiplicity is the number of occurrences of an XML element. We think that a "language" specification should always be optional, else we'd get into too many arguments over the differences, e.g. en-UK versus en-US. But there should always be a default encoding. We choose ISO 8858-1, i.e. Latin-1, which consists of all characters commonly used in most western languages, as the default. We took a quick glance through the specification and see at least the following places where it makes sense (to us) to consider adding language, encoding, or multiplicity declarations to the DTD's. 1) <data-element>.<data-element-concept>.<definition-text> <data-element-dictionary>.<data-element-concept>.<definition-text> Add multiplicity (right now only one <definition-text> is allowed). Allow ItemLang and ItemEncoding specifications as attributes on each occurrence of <definition-text>, or possibly as components of <definition-text>. We don't have a good feeling for when to use elements and when to use attributes so will let someone else make that call. 2) <data-element>.<name-context>.<name-context-label> <data-element>.<name-context>.<designation-name> <data-element-dictionary>.<name-context>.<name-context-label> <data-element-dictionary>.<name-context>.<designation-name> Multiplicity is already present for <name-context>+. We assume that <name-context-label> and <designation-name> will likely come from the same language and encoding, so propose that ItemLang and ItemEncoding be added as attributes on each occurrence of <name-context>. 3) <uri-reference> This is a tough one. In many cases the <uri-reference> will be a URL or URN, thus the encoding rules have already been determined. In other cases the <uri-reference> will just default to whatever encoding scheme is used for the submission package itself. We don't have a proposal right now - maybe it's best to do nothing until we understand how the default encoding for a submission package is determined and understand the intent of a <uri-reference> better. 4) <data-element>.<representation>.<character-set-name> <data-element-dictionary>.<representation>.<character-set-name> It's our understanding that <character-set-name> identifies the default encoding to be used when accessing the identified item in the repository. It is expressed in the default encoding for the submission itself - already determined elsewhere. So no additions or modifications are needed here. PROPOSAL 1) In the file "data-element.dtd", in the definition of the element <data-element-concept>, replace "definition-text" by "definition-text+". Add the following attribute specification: <!ATTLIST definition-text language CDATA #IMPLIED encoding CDATA #IMPLIED "ISO 8859-1" > Note to Editor: Substitute better XML as appropriate. 2) In the file "data-element.dtd", after the definition of the element <name-context>, add the following attribute specification: <!ATTLIST name-context language CDATA #IMPLIED encoding CDATA #IMPLIED "ISO 8859-1" > Note to Editor: Substitute better XML as appropriate. Semantic Rules: a) language is in the format "xx" or "xx-yy" where xx is a language identifier from ISO 639 and yy is a country identifier from ISO 3166, both as specified in RFC1766. b) encoding is a reference to an encoding scheme from ISO/IEC 10646, or some other appropriate international standard, e.g. ISO 8859-x. c) Note: ISO 8859-1 only allows 191 visible characters and blanks, so 8-bit control characters may not port properly unless a more inclusive encoding is specified. ************************************************************** Len Gallagher LGallagher@nist.gov NIST Work: 301-975-3251 Bldg 820 Room 562 Home: 301-424-1928 Gaithersburg, MD 20899-8970 USA Fax: 301-948-6213 **************************************************************
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [Elist Home]
Powered by eList eXpress LLC