OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

regrep message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [Elist Home]

Subject: Re: natural language


Yuta makes the following proposal for labeling language specific entries in
the OASIS registry/repository and asks Lisa what items she thinks should be
so labeled.

>From Yuta Yoshida, March 31 2000
>Metadata of metadata
>  MetaDataLang: String
>	- iso639 + iso3166, specified in RFC1766.
>	ex. en-US, ja-JP
>  MetaDataEncoding : String
>	- character set name or alias assigined in IANA.
>	ex. ISO_8859-1, EUC-JP
>For the component item(metadata of registered item)
>  ItemLang: String
>  ItemEncoding: String
>I don't know where the first one(metadata for metadata) should go
>into. They don't seem to fit into any of 'related data' discussions.
>Lisa, going back to your first mail, where did you want to put
>the language information into?

Lisa and I discussed this for a bit this morning and have a proposal to
make. We think there are three things to consider, the two items Yuta
defines above and "multiplicity", where multiplicity is the number of
occurrences of an XML element. 

We think that a "language" specification should always be optional, else
we'd get into too many arguments over the differences, e.g. en-UK versus
en-US. But there should always be a default encoding.  We choose ISO
8858-1, i.e. Latin-1, which consists of all characters commonly used in
most western languages, as the default.

We took a quick glance through the specification and see at least the
following places where it makes sense (to us) to consider adding language,
encoding, or multiplicity declarations to the DTD's.

 1) <data-element>.<data-element-concept>.<definition-text>

    Add multiplicity (right now only one <definition-text> is allowed).
    Allow ItemLang and ItemEncoding specifications as attributes on each 
    occurrence of <definition-text>, or possibly as components of 
    <definition-text>.  We don't have a good feeling for when to use
    elements and when to use attributes so will let someone else make 
    that call.

 2) <data-element>.<name-context>.<name-context-label>

    Multiplicity is already present for <name-context>+.
    We assume that <name-context-label> and <designation-name> will 
    likely come from the same language and encoding, so propose that
    ItemLang and ItemEncoding be added as attributes on each
    occurrence of <name-context>.

 3) <uri-reference>

    This is a tough one.  In many cases the <uri-reference> will be 
    a URL or URN, thus the encoding rules have already been determined.
    In other cases the <uri-reference> will just default to whatever
    encoding scheme is used for the submission package itself.  We 
    don't have a proposal right now - maybe it's best to do nothing
    until we understand how the default encoding for a submission package
    is determined and understand the intent of a <uri-reference> better.

 4) <data-element>.<representation>.<character-set-name>

    It's our understanding that <character-set-name> identifies the 
    default encoding to be used when accessing the identified item
    in the repository.  It is expressed in the default encoding for
    the submission itself - already determined elsewhere.  So no
    additions or modifications are needed here.


1) In the file "data-element.dtd", in the definition of the element 
   <data-element-concept>, replace "definition-text" by "definition-text+".

   Add the following attribute specification:

   <!ATTLIST definition-text
           language     CDATA    #IMPLIED
           encoding     CDATA    #IMPLIED   "ISO 8859-1"

   Note to Editor: Substitute better XML as appropriate.

2) In the file "data-element.dtd", after the definition of the element 
   <name-context>, add the following attribute specification:

   <!ATTLIST name-context
           language     CDATA    #IMPLIED
           encoding     CDATA    #IMPLIED   "ISO 8859-1"

   Note to Editor: Substitute better XML as appropriate.

Semantic Rules:

   a) language is in the format "xx" or "xx-yy" where xx is a language 
      identifier from ISO 639 and yy is a country identifier from 
      ISO 3166, both as specified in RFC1766.

   b) encoding is a reference to an encoding scheme from ISO/IEC 10646, 
      or some other appropriate international standard, e.g. ISO 8859-x.

   c) Note: ISO 8859-1 only allows 191 visible characters and blanks,
      so 8-bit control characters may not port properly unless a more 
      inclusive encoding is specified.

Len Gallagher                             LGallagher@nist.gov
NIST                                      Work: 301-975-3251
Bldg 820  Room 562                        Home: 301-424-1928
Gaithersburg, MD 20899-8970 USA           Fax: 301-948-6213

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [Elist Home]

Powered by eList eXpress LLC