uima message
[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]
| [List Home]
Subject: Re: [uima] Type System Base Model v.4
- From: Adam Lally <alally@us.ibm.com>
- To: Karin Verspoor <verspoor@lanl.gov>
- Date: Wed, 11 Apr 2007 13:50:35 -0400
I think I agree with that - as far as
the compliance with the spec goes, Apache UIMA just has to be able to ingest
and produce XMI CASes with true Unicode offsets.
However, I think this means that when
a UIMA Annotator calls the Annotation.getBegin() method, they will not
necessarily get the same value as if they looked at the XMI serialization
and checked the value of the begin attribute of that same annotation. That
might be a little confusing.
Perhaps its good that the spec now names
the attribute beginChar and endChar rather than begin and end. Apache
UIMA could introduce a new method Annotation.getBeginChar() which could
always return the true Unicode character offset? Maybe that helps.
Regards,
-Adam
_____________________________
Adam Lally
Advisory Software Engineer
UIMA Framework Lead Developer
IBM T.J. Watson Research Center
Hawthorne, NY, 10532
Tel: 914-784-7706, T/L: 863-7706
alally@us.ibm.com
Karin Verspoor <verspoor@lanl.gov>
04/10/2007 06:03 PM
|
To
| Adam Lally/Watson/IBM@IBMUS
|
cc
| "uima@lists.oasis-open.org"
<uima@lists.oasis-open.org>, <carl.madson@sri.com>, <Sophia.ananiadou@manchester.ac.uk>,
<j.tsujii@manchester.ac.uk>
|
Subject
| Re: [uima] Type System Base Model v.4 |
|
Thanks, Adam.
I hadn’t considered the question you raise, so I do appreciate it.
I’d be inclined to require the use of “real” Unicode character offsets
for standardization purposes.
I think for Apache UIMA to be compliant, it would just need to accept/handle
CASes with Unicode offsets, and whatever happens “under the hood” is
not up to us, is it?
Karin
On 4/2/07 10:12 AM, "Adam Lally" <alally@us.ibm.com> wrote:
Karin,
Overall looks good. I have a comment about the TextAnnotation and
TextRegionalReference definintion.
There are some tricky details regarding a standard definition of beginChar,
endChar. What exactly do these values represent - Unicode character
offsets? Note that Apache UIMA does not use Unicode character offsets,
but rather "UTF-16 code unit" offsets - that is, each element
is a 16-bit value like a char in a Java String. This has the advantage
of allowing constant-time indexing (e.g. for getCoveredTex). Unfortunately
not all Unicode characters can be represented in 16 bits, so sometimes
a single Unicode character is split across two 16 bit values. Apache
UIMA doesn't use true Unicode character offsets in that cases. This
is especially inconvenient for anyone using UTF-8 rather than UTF-16 (for
example, Perl annotators).
I'm not sure what the right answer it, but using "real" Unicode
character offsets seems like it's a more appropriate use of the Unicode
standard. Then, for Apache UIMA to be compliant does it just need
to accept XMI CASes with Unicode offsets and covert them to its own internal
representation? Or would the internal representation have to actually
change (which would be ugly).
Regards,
-Adam
_____________________________
Adam Lally
Advisory Software Engineer
UIMA Framework Lead Developer
IBM T.J. Watson Research Center
Hawthorne, NY, 10532
Tel: 914-784-7706, T/L: 863-7706
alally@us.ibm.com
Karin Verspoor <verspoor@lanl.gov>
03/30/2007 01:50 PM
"uima@lists.oasis-open.org" <uima@lists.oasis-open.org>
[uima] Type System Base Model v.4
Please find attached the reviewed document outlining
changes/proposals/action plan for the Type System Base Model.
Comments welcome.
Karin
_______________________________________________________________
Karin Verspoor, Computational Linguist
Knowledge and Information Systems Science team
Computer, Computation & Statistics division
http://public.lanl.gov/verspoor
email: verspoor@lanl.gov Mail: Los Alamos National Laboratory
phone: 505-667-5086 PO
Box 1663, MS B256
fax: 505-667-1126 Los
Alamos, NM 87545
_______________________________________________________________
[attachment "TypeSystemBaseModel_v4.doc" deleted by Adam Lally/Watson/IBM]
_______________________________________________________________
Karin Verspoor, Computational Linguist
Knowledge and Information Systems Science team
Computer, Computation & Statistics division
http://public.lanl.gov/verspoor
email: verspoor@lanl.gov Mail: Los Alamos National Laboratory
phone: 505-667-5086 PO
Box 1663, MS B256
fax: 505-667-1126 Los
Alamos, NM 87545
_______________________________________________________________
[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]
| [List Home]