regrep message

Subject: Re: natural language
From: Terry Allen <tallen@sonic.net>
To: regrep@lists.oasis-open.org
Date: Sat, 1 Apr 2000 12:45:23 -0800
Lisa wrote:
| In thinking about the  information that a submitter would want a user to
| know, the issue of natural languages has come up.
| 
| Two questions: 
| 
| Should we specify an element that contains the natural language that the
| metadata is written in?

This has started a long thread.  The answer is no, because we have
an xml:lang attribute for just this purpose, and consistent with 
other XML applications, on data-element-dictionary and d.e.d-set.
If we need it on other elements, please specify which ones.

| Should we specify an element that contains the natural language that the
| registered-item (data element) is written in?   I'm assuming that
| XML-related  and SGML-related documents can be written in natural languages
| other than english?

Your assumption is correct; but a data-element itself isn't written
in a natural language, only its name and description and maybe
a few other bits.  The assumption I made was that an entire d.e.d.
would use the same nl consistently.

| We could use language-code as defined by ISO639 and a subcode of
| country-code as defined by ISO3166.  (This is IMS' usage.)

It's also the XML 1.0 usage.

Yutaka replied:
|  > Should we specify an element that contains the natural language that the
|  > metadata is written in?
|  
|  I'm not sure what you want to do with that information, but probably,
|  I guess the problem here is not that it lacks the information about
|  natural language, but that it lacks the information about the
|  encoding in which the element is.

That is given at the level of the XML instance, not at the element
level, so we don't have to worry about it.

|  However, it brings up another question: is it likely that
|  one element is written in one language and the other element is
|  in another language?

I'm assuming not.  We have enough to worry about already.

|  > Should we specify an element that contains the natural language that the
|  > registered-item (data element) is written in?   I'm assuming that
|  > XML-related  and SGML-related documents can be written in natural languages
|  > other than english?
|  
|  Yes, I think so. Also, in this case, it's very likely that one
|  registered item is written in one language and the other item is
|  written in the other language even they are in one 'packaged'
|  objects.

As we haven't any experince yet, it is not reasonable to assert that
language mixing is "very likely".  And as we're dealing with 
registered items that are data element dictionaries it is confusing,
to say the least, to drag in new terminology ('packaged' object).

That said, we probably need xml:lang on classification scheme and
on the related-data element I'll be inserting in the revision I'm
working on now.

Lisa:
| Regarding the natural language of the metadata:   Is there a possibility
| that an organization (an SO) would want to register an object, and provide
| the metadata to that object in a language other than english?  If I, as a
| user, then wish to read the metadata about the object, wouldn't it be
| useful to know the language that I am looking at (assuming that I can't
| determine the language used merely by looking at it?).
| 
| I'm willing to accept that I may be the only one who sees this as useful
| information.

No, it's useful information, and already provided for at a coarse
level of granularity.  

Robin adds:
| What's happening here, I observe, is that we are discussing
| meta-data for metadata... (just one of the reasons I think the
| term/concept of "metadata" is highly problematic).  Because
| this is a recursive problem, most major DTDs (and indeed,
| even the XML specification) use a lang="" attribute as
| "global" (applicable to the smallest factoid encoded in
| a sub-string, down to the character-level).  As in TEI:
| 
|   The "lang" attribute indicates the language, writing system, and
|   character set associated with a given element and all its contents.
|   If it is not specified, the value is inherited from that of the
|   immediately enclosing element.
| 
| XML (in xml:lang) http://www.w3.org/TR/REC-xml#sec-lang-tag
| got this "wrong", I think, but the committee decision
| reflects the nature of the problem:
|
|    "2.12 Language Identification: In document processing, it is
|    often useful to identify the natural or formal language in 
|    which the content is written. A special attribute named
|    xml:lang may be inserted in documents to specify the 
|    language used in the contents and attribute values of any 
|    element in an XML document... The intent declared with xml:lang 
|    is considered to apply to all attributes and content of the 
|    element where it is specified, unless overridden with an instance 
|    of xml:lang on another element within that content."
| 
| This is inadequate for most cases of multi-lingual development and
| implementation because it assumes that the value of the "xml:lang"
| attribute specifies in all cases the (one) language of several
| different information items:
| 
| * the element-type name
| * the attribute names
| * the values of the other attributes (in the start-tag)
| * the PCDATA content in the element, and
| * the language of information in subelements, unless overridden
| 
| ... which sort of makes it impossible to gloss a German text
| using English language in the XML markup language.

No.  The XML spec specifically deals only with content, not the
language, if any, in which the e-t names, att names, and attvals
are written.  Those of course could be documented in the DTD, but
they are not information items necessary to supply in the instance.
As for subelements, overriding is already specified by the XML spec.
I just don't see an issue here.

| So whatever we specify as being governed by a language
| descriptor, it should be singular, or factorable.
| 
| I do agree that any complete description of an artifact in any
| descriptive notation should include an indication of the natural 
| (or other) language(s) used.  The ability to filter on this
| information is often critical, as I noted in a post this morning:
| 
|   "I found some slides on transaction ACIDity
|    [http://www.insa-lyon.fr/People/LISI/laurini/disic/feder/sld043.htm].
|    Unfortunately, they're in French and I don't speak french..."
| 
| The designation of language encoding for machine purposes is even
| more critical, as we all know.  Here, it's necessary to isolate
| language from script (Hebrew can be written in Arabic), and other
| aspects of writing systems.

So you need to know the encoding (for Hebrew written in Arabic),
but that comes at the document level.

| The principal question may be: which "factoids" in the RR specification
| should be supported by a language descriptor?

That's the only question.

| On "localization": I suggest having a look at TEI "Writing System
| Declaration" at
| 
| http://etext.lib.virginia.edu/bin/tei-tocs?div=DIV1&id=WD

The TEI WSD is a wonderful thing, but I think that as we've 
decided to use XML we don't need its complexity.

Yutaka replied:
|  > The designation of language encoding for machine purposes is even
|  > more critical, as we all know.  Here, it's necessary to isolate
|  > language from script (Hebrew can be written in Arabic), and other
|  > aspects of writing systems.
|  
|  Sorry, I don't understand what you said. Could you explain a little more?
|  What I meant by encoding was 'encoding scheme', such as iso8859-1,
|  eucjp, gb2312, etc. In that sense, for a computational purpose,
|  it doesn't matter what script is used. Hebrew is 8859-8 and Arabic is
|  8859-6, so we can process the content correctly if we knew those
|  encodings.

No, just because you know the encoding is ISO 8859-8 doesn't mean
you know that the content is Hebrew.  You may be able to render the
content correctly (in Hebrew script) but not process it correctly
for all purposes (sorting by language, for example).

Robin replied:
| This is probably off topic.  I'm talking about natural language
| processing based upon linguistic features of written text.  When
| a word/phrase is transliterated or borrowed from one
| language into another (as when a Hebrew word is written in
| Arabic script), the word/phrase in the new context has
| linguistic properties that cannot be deduced from the encoding
| or script.  While simple display might work (direction of
| character flow, kerning, etc.), other processing would
| fail (correct word wrap, spell checking, thesaurus, and
| so forth).  In brief: a script or encoding does not always
| tell you what language the text is "in".  This is why (mere)
| "localization" does not work, of itself, in a
| multilingual setting.  Multilingualism necessitates true
| linguistic knowledge, where "internationalization"
| often (believes it) does not.

Exactly; but the combination of encoding and language (which
we *already have in the DTD*) is enough.  There are wrinkles
in the use of Unicode (it would be impossible to sort documents
according to what script they used if you knew only that they
were in UTF-8 without peeking inside them), but there is no
problem here that users cannot avoid by labelling the language
and encoding they use and refraining from writing languages in
other than their customary scripts.

Lisa:
| So what is the answer?  Can someone make a proposal?

Yes, forget about it until and unless some problem actually arises ...

Len write:
| Yuta makes the following proposal for labeling language specific entries in
| the OASIS registry/repository and asks Lisa what items she thinks should be
| so labeled.
| 
| >From Yuta Yoshida, March 31 2000
| >Metadata of metadata
| >  MetaDataLang: String
| >	- iso639 + iso3166, specified in RFC1766.
| >	ex. en-US, ja-JP
| >  MetaDataEncoding : String
| >	- character set name or alias assigined in IANA.
| >	ex. ISO_8859-1, EUC-JP
| >
| >For the component item(metadata of registered item)
| >  ItemLang: String
| >  ItemEncoding: String

It is completely unnecessary to speak of metadata of metadata.
In XML terms all we need to know is the language and encoding
of the document.

| >I don't know where the first one(metadata for metadata) should go
| >into. They don't seem to fit into any of 'related data' discussions.
| >Lisa, going back to your first mail, where did you want to put
| >the language information into?
| 
| Lisa and I discussed this for a bit this morning and have a proposal to
| make. We think there are three things to consider, the two items Yuta
| defines above and "multiplicity", where multiplicity is the number of
| occurrences of an XML element. 
| 
| We think that a "language" specification should always be optional, else
| we'd get into too many arguments over the differences, e.g. en-UK versus
| en-US. But there should always be a default encoding.  We choose ISO
| 8858-1, i.e. Latin-1, which consists of all characters commonly used in
| most western languages, as the default.

XML specifies Unicode, though with some shifty language that lets you
weasel out of using it.  8858-1 is entirely inadequate even for 
European languages.  We don't have to get involved in the matter;
implementors who are creating interfaces for humans to use can
restrict the encodings used; those accepting already packaged
submissions can insist on an encoding declaration and refused
submissions in encodings (and languages) they don't want to support,
but those are implementation issues for individual RAs.

| We took a quick glance through the specification and see at least the
| following places where it makes sense (to us) to consider adding language,
| encoding, or multiplicity declarations to the DTD's.
| 
|  1) <data-element>.<data-element-concept>.<definition-text>
|     <data-element-dictionary>.<data-element-concept>.<definition-text>
| 
|     Add multiplicity (right now only one <definition-text> is allowed).
|     Allow ItemLang and ItemEncoding specifications as attributes on each 
|     occurrence of <definition-text>, or possibly as components of 
|     <definition-text>.  We don't have a good feeling for when to use
|     elements and when to use attributes so will let someone else make 
|     that call.

We can add xml:lang at this level of granularity, but I think someone
needs to construct a scenario to justify allowing this kind of
complexity.  

As for multiplicity, allowing elements to occur more often is the
wrong way to go, because it does not permit sufficient control of
content.  Docbook has a pretty good answer in <phrase>:

<sect1>
<title><!-- only one title allowed per section -->
<phrase lang="en">Love</phrase>
<phrase lang="fr">Amour</phrase>
</title> ...

which Sun actually uses in its documentation.  This way you
can set your style sheet to display the language you want,
all of them, any choice of them, and not debauch your 
multiplicity for <title>.  We can add in <phrase>, but as
I say, we need a good reason to complicate things right now.

|  2) <data-element>.<name-context>.<name-context-label>
|     <data-element>.<name-context>.<designation-name>
|     <data-element-dictionary>.<name-context>.<name-context-label>
|     <data-element-dictionary>.<name-context>.<designation-name>
| 
|     Multiplicity is already present for <name-context>+.
|     We assume that <name-context-label> and <designation-name> will 
|     likely come from the same language and encoding, so propose that
|     ItemLang and ItemEncoding be added as attributes on each
|     occurrence of <name-context>.

We could do that, to support multilingual dictionaries, but we
might need another layer of markup to handle the case where the
name context is a programming language for which people use
multiple natural languages.  Again, scenario needed.

|  3) <uri-reference>
| 
|     This is a tough one.  In many cases the <uri-reference> will be 
|     a URL or URN, thus the encoding rules have already been determined.

The encoding is given at the instance level.

|     In other cases the <uri-reference> will just default to whatever
|     encoding scheme is used for the submission package itself.  We 

In ALL cases, and it will not default to something connected with
the submission package but to the encoding of the instance.

|     don't have a proposal right now - maybe it's best to do nothing
|     until we understand how the default encoding for a submission package
|     is determined and understand the intent of a <uri-reference> better.

You have not distinguished between the URI (which we need not deal
with) and the content of the uri-reference element - I can see it
might be quite useful even now to allow xml:lang on that element,
in case one is pointing to something like a classification scheme
in another language.

|  4) <data-element>.<representation>.<character-set-name>
|     <data-element-dictionary>.<representation>.<character-set-name>
| 
|     It's our understanding that <character-set-name> identifies the 
|     default encoding to be used when accessing the identified item
|     in the repository.  It is expressed in the default encoding for
|     the submission itself - already determined elsewhere.  So no
|     additions or modifications are needed here.

Character-set-name is in the DTD because it's in 11179, and it may not 
even be useful for many registered items.
 
| PROPOSAL
| 
| 1) In the file "data-element.dtd", in the definition of the element 
|    <data-element-concept>, replace "definition-text" by "definition-text+".
| 
|    Add the following attribute specification:
| 
|    <!ATTLIST definition-text
|            language     CDATA    #IMPLIED
|            encoding     CDATA    #IMPLIED   "ISO 8859-1"
|    >

Amended, the proposal would be simply to add xml:lang where
requested; unless there is overwhelming interest in proceeding
without a supporting scenario, let's leave the level of granularity
where it is.  As soon as you come up with a problem (and IMS is
more likely to than XML.org, I'll bet), let us know.

[rest of proposal snipped]

I'll get back to my backlog of regrep mail (and continue revision
of the spec) tomorrow after the basketball is over ...

best regards, Terry
Follow-Ups:
- Language/Encoding tags for Definitions and Names
  - From: Len Gallagher <LGallagher@nist.gov>