regrep message

Subject: Re: natural language
From: Yutaka Yoshida <Yutaka.Yoshida@eng.sun.com>
To: regrep@lists.oasis-open.org
Date: Thu, 30 Mar 2000 14:27:50 -0800 (PST)
Thanks for the quick reply, Robin.

Yes, I think you're right - natural language processing
seems a little off the topic, and I feel it's beyond the
regrep field. However, I think having locale and endocing
info is very helpful because it could be used not only
for natural language processing, but also other text
manipulation such as collation, sorting, formating.
So, how about having the following elements

Metadata of metadata
  MetaDataLang: String
	- iso639 + iso3166, specified in RFC1766.
	ex. en-US, ja-JP
  MetaDataEncoding : String
	- character set name or alias assigined in IANA.
	ex. ISO_8859-1, EUC-JP

For the component item(metadata of registered item)
  ItemLang: String
  ItemEncoding: String

I don't know where the first one(metadata for metadata) should go
into. They don't seem to fit into any of 'related data' discussions.
Lisa, going back to your first mail, where did you want to put
the language information into?

yuta

 > From: lisa.carnahan@nist.gov
 > Date: Thu, 30 Mar 2000 15:35:55 -0500
 > 
 > So what is the answer?  Can someone make a proposal?
 > 
 > --lisa
 > 
 > At 01:43 PM 03/30/2000 -0600, you wrote:
 > >
 > >
 > >On Thu, 30 Mar 2000, Yutaka Yoshida wrote:
 > >
 > >> 
 > >>  > Date: Thu, 30 Mar 2000 12:49:53 -0600 (CST)
 > >>  > From: Robin Cover <robin@isogen.com>
 > >>  > 
 > >>  > 
 > >>  > The designation of language encoding for machine purposes is even
 > >>  > more critical, as we all know.  Here, it's necessary to isolate
 > >>  > language from script (Hebrew can be written in Arabic), and other
 > >>  > aspects of writing systems.
 > >>  
 > >>  Sorry, I don't understand what you said. Could you explain a little more?
 > >>  What I meant by encoding was 'encoding scheme', such as iso8859-1,
 > >>  eucjp, gb2312, etc. In that sense, for a computational purpose,
 > >>  it doesn't matter what script is used. Hebrew is 8859-8 and Arabic is
 > >>  8859-6, so we can process the content correctly if we knew those
 > >>  encodings.
 > >> 
 > >>  regards,
 > >>  yuta
 > >> 
 > >This is probably off topic.  I'm talking about natural language
 > >processing based upon linguistic features of written text.  When
 > >a word/phrase is transliterated or borrowed from one
 > >language into another (as when a Hebrew word is written in
 > >Arabic script), the word/phrase in the new context has
 > >linguistic properties that cannot be deduced from the encoding
 > >or script.  While simple display might work (direction of
 > >character flow, kerning, etc.), other processing would
 > >fail (correct word wrap, spell checking, thesaurus, and
 > >so forth).  In brief: a script or encoding does not always
 > >tell you what language the text is "in".  This is why (mere)
 > >"localization" does not work, of itself, in a
 > >multilingual setting.  Multilingualism necessitates true
 > >linguistic knowledge, where "internationalization"
 > >often (believes it) does not.
 > >
 > >Robin Cover
 > >
 > >
Follow-Ups:
- Re: natural language
  - From: Len Gallagher <LGallagher@nist.gov>