uima message

Subject: Re: [uima] [Type System Base Model subgroup]

From: Adam Lally <alally@us.ibm.com>
To: Karin Verspoor <verspoor@lanl.gov>
Date: Wed, 21 Feb 2007 14:42:28 -0500

This particular implementation issue relating to indexing of non-annotation types will be fixed in version 2.1 to be released soon from Apache.

The intention of Annotation is to describe a region of the subject of analysis. I think the whitepaper may be a little unclear on whether Annotations must have a regional reference, but in the UML diagram in figure 7 it is suggested that annotations do have a RegionalReference, where the definition of RegionalReference is open, so that it can be more than just begin, end offsets.

Other types of objects in the CAS may be "subordinate" objects to which annotations can refer, or higher level structures that refer to multiple annotations (for example a single "Entity" object representing a particular person, which refers to multiple annotation objects, one for each mention of that person).

Regards,
-Adam
_____________________________
Adam Lally
Advisory Software Engineer
UIMA Framework Lead Developer
IBM T.J. Watson Research Center
Hawthorne, NY, 10532
Tel: 914-784-7706, T/L: 863-7706
alally@us.ibm.com

Karin Verspoor <verspoor@lanl.gov>

02/21/2007 02:04 PM

To	David Ferrucci/Watson/IBM@IBMUS, Scott Songlin Piao <scott.piao@manchester.ac.uk>
cc	"uima@lists.oasis-open.org" <uima@lists.oasis-open.org>
Subject	Re: [uima] [Type System Base Model subgroup]

I think an Annotation that does not refer to a span is essentially meta-data associated with the artifact. In that sense, it is like any other object in the CAS (including “normal” Annotations). But what other types of things would be in the CAS?

Note that the spec currently does not require Annotations to have spans.

In the existing implementation of UIMA — which is not necessarily the same as what is described in the spec — a problem with creating a type that extends from “Top” and not from “Annotation” is that objects of that type are not automatically indexed when addToIndexes() is called, and therefore not easily accessible from the CAS. Which means that to keep track of objects like Keyterms that are not associated with a span, you have to define a special index that is explicitly referenced by an Analysis Engine. It would be cleaner to not have to treat non-span-associated meta-data (artifact-level markup) differently from span-associated meta-data (span-level markup), as is currently required.

Karin

On 2/21/07 11:47 AM, "David Ferrucci" <ferrucci@us.ibm.com> wrote:

we have traditionally reserved Annotation to be an object in the CAS that does refer to a span (or more generally, have a regional reference which can refer to arbitrary defined region -- not just a span)

our practice has been, if a object in the CAS does not label/"annotate a span (or more generally a region) we typically do not make it an Annotation.

Objects that describe extracted elements, rather than explicit regions of a sofa, for example, may refer to them through arbitrary features, but we have not thought of these as "annotations"

I am not sure what a more general Annotation type would mean if it does not explicitly refer to some region in the artifact which it 'annotates' - seems then at that level of description, it would indistinguishable from any other object in the CAS.

referring to discontinuous regions is a requirement that has come up from time to time and I believe the current draft does not discuss it. I think a worthy requirement needing further discussion.

-dave

------------------------------------------------------------------------
David A. Ferrucci, PhD
Senior Manager, Semantic Analysis & Integration
Chief Architect, UIMA
IBM T.J. Watson Research Center
19 Skyline Drive, Hawthorne, NY 10532
Tel: 914-784-7847, 8/863-7847
ferrucci@us.ibm.com
------------------------------------------------------------------------
http://www.ibm.com/research/uima

"Scott Songlin Piao" <scott.piao@manchester.ac.uk> 02/21/2007 05:40 AM

"Karin Verspoor" <verspoor@lanl.gov>, "thomas.hampp@de.ibm.com" <thomas.hampp@de.ibm.com>, "kano@is.s.u-tokyo.ac.jp" <kano@is.s.u-tokyo.ac.jp>, "pascal.coupet@temis.com" <pascal.coupet@temis.com>

"uima@lists.oasis-open.org" <uima@lists.oasis-open.org>

Subject

RE: [uima] [Type System Base Model subgroup]

>.. there will be Annotations that don't correspond to explicit
>individual text spans, e.g. document keyterms that are extracted through
>statistical analysis of the document as a whole (where one may not wish to
>annotate each individual instance of that keyterm in the document)

This is true, and I am experiencing this problem myself now. I am wrapping up a term extractor (named Termine) into UIMA, and found it is problematic to assign offsets to the terms extracted from a document (or documents). Because the terms are identified using co-occurrence statistical information, the terms extracted do not necessarily correspond to every identical text string/text span occurring in the document(s). If the terms include discontinuous multiword terms, things can be worse.

So I think the top Annotation type should not have offset features, and a subtype can be defined including offsets.

Cheers

Scott
-----------------------------------
Dr. Scott Piao
NaCTeM & School of computer Science
University of Manchester
UK

-----Original Message-----
From: Karin Verspoor [mailto:verspoor@lanl.gov]
Sent: 21 February 2007 01:16
To: thomas.hampp@de.ibm.com; kano@is.s.u-tokyo.ac.jp; pascal.coupet@temis.com
Cc: uima@lists.oasis-open.org
Subject: [uima] [Type System Base Model subgroup]

Following in Adam's footsteps, I'd like to kick off the discussion for the
Type System Base Model subgroup.

According to
http://www.oasis-open.org/apps/org/workgroup/uima/download.php/22325/UIMA%20
TC%20Sub-Groups_v2.pdf, the members are:

Thomas
Kano
Pascal
Karin (me)

We are due to report on March 2, which doesn't give us much time for
discussion or preparing our report. So let's get started! Below you will
find some initial thoughts I have had, though I'm sure there are more things
to discuss and in fact I haven't quite finished getting my thoughts down.
I'll ask that everyone who is listed above respond to these and bring up new
issues/discussion points by Thursday 2/22. We can then use these to develop
the action plan for the spec element.

As a reminder, we are tasked with the following:

===================================
Sub-Groups reports should include:
===================================

1. Goals of spec element. What is it trying to achieve in terms of
interoperability?
2. Overall Critique of section. High-level summary of findings. How good/bad
is it in meeting goals. What's the damage? Looks good, just needs some
wordsmithing, has some serious conceptual issues, etc.

3. "Votable" issues. Crisp decisions the TC should vote on required to
harden/complete spec element.
4. Open-Issues. Issues that need extended discussion to resolve
5. List of compliance points. What aspects of this spec element "can",
"must" be adhered to in order to be "compliant")
6. Action Plan. Very important -- List of tasks required to bring spec
element to completion.

===================================
Goals of spec element
===================================

Section 5.3 of the specification aims to define the set of predefined types
that are assumed to be available to any UIMA-compliant analytic or system.

The spec adopts the primitive types defined by Ecore, covering String,
Boolean, Byte, Short, Int, Long, Float and Double. The main primitive Java
type missing from this list is Char, but this is not defined by Ecore and
can be handled using Int.

===================================
Critique of section
===================================

I believe that this is the first place in the spec where Annotations are
really defined. However, some details are left unspecified and I think the
section as a whole could be reorganized to clarify the data model for
Annotations -- to formalize what is discussed in 5.1.3. Note that really,
5.1.3 and the whole of 5.3 need to be worked on together.

C_1. For instance, in the intro to section 5, it is stated that "we defined
a type Annotation to represent objects that have regional references (e.g.,
offsets) into the value of an attribute of another object." but the section
itself doesn't specify explicitly that Annotations should have begin/end
features to indicate those offsets (though they are used in the example)
right at the front (though this comes in in 5.3.4.1), and doesn't clearly
specify what those offsets should be relative to (see C_2 and OI-1).
C_1a. I don't think it is appropriate for all Annotations to have
begin/end features, and I do prefer the approach to adding this in via
subtyping. This is because there will be Annotations that don't correspond
to explicit individual text spans, e.g. document keyterms that are extracted
through statistical analysis of the document as a whole (where one may not
wish to annotate each individual instance of that keyterm in the document)
or other meta-data that is associated with a document through holistic
analysis of the document.

C_2. In the case where an Annotation points into another Annotation (e.g.
the Clause within a Quotation example in 5.3.3.1) via a LocalSofaReference,
are those offsets relative to the Local Sofa (Quotation), or the document as
a whole? These are two different choices, and which is selected for the
standard should be spelled out. If the offsets are relative to the span of
the other Annotation, there will be some "offset management" that needs to
be done to map offsets back to document spans. And what if the "source"
Annotation doesn't have document spans specified? [See also OI_1 which is
germaine here.]

C_3. Examples should use features introduced elsewhere, e.g. "beginChar"
and "endChar" rather than "begin" and "end".

===================================
Votable issues
===================================

(I think we should identify these after we discuss the other sections)

===================================
Open issues
===================================

OI_1. I believe that there is some confusion in the document about the use
of the concept of a "sofa" (Subject of Analysis). In Section 5.1.3, sofas
are defined as the content an annotation refers to. This is also how it is
used in 5.3.2, where Annotations have sofa features. In other text, e.g.
5.3.4.1, and in the current UIMA SDK Reference manual, a sofa refers
generically to a view of a document (e.g. a textual transcription of a audio
stream, an English translation of a French document, or an ASCII text
version of a PDF document). Then 5.3.4.2 explicitly introduces the concept
of a "View". Perhaps what is intended is that an Annotation refers to a span
within a specific view. Does this then imply that a sofaObject in a
SofaReference isn't just identified through a single identifier (id="1" in
the Quotation example), but through a compound of View name and sofaObject
ID? But this would require individual SofaReferences to know which Views
they are a part of, rather than the View pointing to the relevant collection
of CAS objects. As you can tell, I'm not sure I actually have a problem
with what is currently proposed in the spec -- it may be the cleanest way to
handle these different facets (in particular because it allows an object in
one View to reference objects in other Views). But we should work through
it, and clarify the descriptions throughout the spec to consistently use the
terminology "sofa" and "view", and to make sure that offsets are handled
appropriately per discussion above. The confusion is perhaps highlighted by
the discussion in section 5.3.4.3 in which an AnchoredView is tied to a
specific SofaReference -- again, is it the sofa or the view that captures
the metadata of a particular perspective of a document?

OI_2. There is an open issue explicitly called out in the text of 5.3.4.1:
whether to define a "RegionalReference" type or to subtype Annotation with
different regional reference mechanisms. After discussion, the authors of
the section suggest to include an abstract RegionalReference in the base
type system, but not mandate it. This is probably a sensible approach. If
we all agree with that, then we'll have to rewrite the section somewhat to
clarify this decision. Of course, if the type system mirrored the Java
class/interface model and thereby supported a kind of multiple inheritance,
some of the disadvantages of subtyping would go away... But I guess that's
too big of a change.

OI_3. We should discuss Footnote 4 on p. 39.

OI_4. We should also discuss Source Document Information, in 5.3.4.4. I
have thoughts here, but not enough time right now to spell them out :).

OI_5. Is there anything missing in this section that we'd like to see added?
Other things that fall into the category of "suggested other types we found
useful"?

===================================
Compliance points
===================================

There are currently three candidate compliance points relevant for the
section:

5.1.3: A UIMA component/framework may be “annotation model compliant” if it
uses this definition by the UIMA Type-System base model.

5.3.1: A compliant UIMA component/framework may be required to understand
this set of primitive types, and may be required to treat EObject as the
superclass of all classes.

5.3.3: A UIMA component/framework that is "annotation model compliant" may
be required to adhere to the constraint that all Annotation objects must
have a sofa slot that holds a reference to either a LocalSofaReference or a
RemoteSofaReference.

Comments:

Apart from the obvious language change from "may be required" to "is
required" and the removal of the attribution "candidate", there are some
other issues here.

CP_1. I do think that EObject to be assumed to be the superclass of all
classes. It is important to have a single common superclass for ease of
programming.

CP_2. The 5.1.3 compliance point should be more explicit about what "this
definition" precisely refers to.

_______________________________________________________________
Karin Verspoor, Computational Linguist
Knowledge and Information Systems Science team
Computer, Computation & Statistics division
http://public.lanl.gov/verspoor
email: verspoor@lanl.gov Mail: Los Alamos National Laboratory
phone: 505-667-5086 PO Box 1663, MS B256
fax: 505-667-1126 Los Alamos, NM 87545
_______________________________________________________________

_______________________________________________________________
Karin Verspoor, Computational Linguist
Knowledge and Information Systems Science team
Computer, Computation & Statistics division
http://public.lanl.gov/verspoor
email: verspoor@lanl.gov Mail: Los Alamos National Laboratory
phone: 505-667-5086 PO Box 1663, MS B256
fax: 505-667-1126 Los Alamos, NM 87545
_______________________________________________________________

Follow-Ups:
- RE: [uima] [Type System Base Model subgroup]
  - From: "Pascal Coupet" <pascal.coupet@temis.com>

References:
- Re: [uima] [Type System Base Model subgroup]
  - From: Karin Verspoor <verspoor@lanl.gov>