OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

uima message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: Re: [uima] Type System Base Model v.4


Title: Re: [uima] Type System Base Model v.4
Thanks, Adam.

I hadn’t considered the question you raise, so I do appreciate it.

I’d be inclined to require the use of “real” Unicode character offsets for standardization purposes.

I think for Apache UIMA to be compliant, it would just need to accept/handle CASes with Unicode offsets, and whatever happens “under the hood” is not up to us, is it?

Karin


On 4/2/07 10:12 AM, "Adam Lally" <alally@us.ibm.com> wrote:


Karin,

Overall looks good.  I have a comment about the TextAnnotation and TextRegionalReference definintion.

There are some tricky details regarding a standard definition of beginChar, endChar.  What exactly do these values represent - Unicode character offsets?  Note that Apache UIMA does not use Unicode character offsets, but rather "UTF-16 code unit" offsets - that is, each element is a 16-bit value like a char in a Java String.  This has the advantage of allowing constant-time indexing (e.g. for getCoveredTex).   Unfortunately not all Unicode characters can be represented in 16 bits, so sometimes a single Unicode character is split across two 16 bit values.  Apache UIMA doesn't use true Unicode character offsets in that cases.  This is especially inconvenient for anyone using UTF-8 rather than UTF-16 (for example, Perl annotators).  

I'm not sure what the right answer it, but using "real" Unicode character offsets seems like it's a more appropriate use of the Unicode standard.  Then, for Apache UIMA to be compliant does it just need to accept XMI CASes with Unicode offsets and covert them to its own internal representation?  Or would the internal representation have to actually change (which would be ugly).

Regards,
  -Adam
_____________________________
 Adam Lally
 Advisory Software Engineer
 UIMA Framework Lead Developer
 IBM T.J. Watson Research Center
 Hawthorne, NY, 10532
 Tel: 914-784-7706,  T/L: 863-7706
 alally@us.ibm.com


Karin Verspoor <verspoor@lanl.gov> 03/30/2007 01:50 PM

To

"uima@lists.oasis-open.org" <uima@lists.oasis-open.org>

cc
Subject

[uima] Type System Base Model v.4




Please find attached the reviewed document outlining
changes/proposals/action plan for the Type System Base Model.

Comments welcome.

Karin

_______________________________________________________________
Karin Verspoor, Computational Linguist
Knowledge and Information Systems Science team
Computer, Computation & Statistics division
http://public.lanl.gov/verspoor
email: verspoor@lanl.gov   Mail: Los Alamos National Laboratory
phone: 505-667-5086              PO Box 1663, MS B256
fax:   505-667-1126              Los Alamos, NM 87545
_______________________________________________________________


[attachment "TypeSystemBaseModel_v4.doc" deleted by Adam Lally/Watson/IBM]



_______________________________________________________________
Karin Verspoor, Computational Linguist
Knowledge and Information Systems Science team
Computer, Computation & Statistics division
http://public.lanl.gov/verspoor
email: verspoor@lanl.gov   Mail: Los Alamos National Laboratory
phone: 505-667-5086              PO Box 1663, MS B256
fax:   505-667-1126              Los Alamos, NM 87545
_______________________________________________________________



[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]