uima message

Subject: RE: [uima] [Type System Base Model subgroup]
From: "Scott Songlin Piao" <scott.piao@manchester.ac.uk>
To: "Karin Verspoor" <verspoor@lanl.gov> , "thomas.hampp@de.ibm.com"<thomas.hampp@de.ibm.com> , "kano@is.s.u-tokyo.ac.jp"<kano@is.s.u-tokyo.ac.jp> , "pascal.coupet@temis.com"<pascal.coupet@temis.com>
Date: Wed, 21 Feb 2007 10:40:58 +0000

>.. there will be Annotations that don't correspond to explicit 
>individual text spans, e.g. document keyterms that are extracted through 
>statistical analysis of the document as a whole (where one may not wish to 
>annotate each individual instance of that keyterm in the document)

This is true, and I am experiencing this problem myself now. I am wrapping up a term extractor (named Termine) into UIMA, and found it is problematic to assign offsets to the terms extracted from a document (or documents). Because the terms are identified using co-occurrence statistical information, the terms extracted do not necessarily correspond to every identical text string/text span occurring in the document(s). If the terms include discontinuous multiword terms, things can be worse.

So I think the top Annotation type should not have offset features, and a subtype can be defined including offsets.

Cheers

Scott
-----------------------------------
Dr. Scott Piao
NaCTeM & School of computer Science
University of Manchester
UK




-----Original Message-----
From: Karin Verspoor [mailto:verspoor@lanl.gov] 
Sent: 21 February 2007 01:16
To: thomas.hampp@de.ibm.com; kano@is.s.u-tokyo.ac.jp; pascal.coupet@temis.com
Cc: uima@lists.oasis-open.org
Subject: [uima] [Type System Base Model subgroup]

Following in Adam's footsteps, I'd like to kick off the discussion for the
Type System Base Model subgroup.

According to 
http://www.oasis-open.org/apps/org/workgroup/uima/download.php/22325/UIMA%20
TC%20Sub-Groups_v2.pdf, the members are:

Thomas
Kano
Pascal
Karin (me)

We are due to report on March 2, which doesn't give us much time for
discussion or preparing our report.  So let's get started! Below you will
find some initial thoughts I have had, though I'm sure there are more things
to discuss and in fact I haven't quite finished getting my thoughts down.
I'll ask that everyone who is listed above respond to these and bring up new
issues/discussion points by Thursday 2/22.  We can then use these to develop
the action plan for the spec element.

As a reminder, we are tasked with the following:

===================================
Sub-Groups reports should include:
===================================

1. Goals of spec element. What is it trying to achieve in terms of
interoperability? 
2. Overall Critique of section. High-level summary of findings. How good/bad
is it in meeting goals. What's the damage? Looks good, just needs some
wordsmithing, has some serious conceptual issues, etc.

3. "Votable" issues. Crisp decisions the TC should vote on required to
harden/complete spec element.
4. Open-Issues. Issues that need extended discussion to resolve
5. List of compliance points. What aspects of this spec element "can",
"must" be adhered to in order to  be "compliant")
6. Action Plan. Very important -- List of tasks required to bring spec
element to completion.



===================================
Goals of spec element
===================================

Section 5.3 of the specification aims to define the set of predefined types
that are assumed to be available to any UIMA-compliant analytic or system.

The spec adopts the primitive types defined by Ecore, covering String,
Boolean, Byte, Short, Int, Long, Float and Double.  The main primitive Java
type missing from this list is Char, but this is not defined by Ecore and
can be handled using Int.

===================================
Critique of section
===================================

I believe that this is the first place in the spec where Annotations are
really defined.  However, some details are left unspecified and I think the
section as a whole could be reorganized to clarify the data model for
Annotations -- to formalize what is discussed in 5.1.3.  Note that really,
5.1.3 and the whole of 5.3 need to be worked on together.

C_1. For instance, in the intro to section 5, it is stated that "we defined
a type Annotation to represent objects that have regional references (e.g.,
offsets) into the value of an attribute of another object." but the section
itself doesn't specify explicitly that Annotations should have begin/end
features to indicate those offsets (though they are used in the example)
right at the front (though this comes in in 5.3.4.1), and doesn't clearly
specify what those offsets should be relative to (see C_2 and OI-1).
  C_1a. I don't think it is appropriate for all Annotations to have
begin/end features, and I do prefer the approach to adding this in via
subtyping.  This is because there will be Annotations that don't correspond
to explicit individual text spans, e.g. document keyterms that are extracted
through statistical analysis of the document as a whole (where one may not
wish to annotate each individual instance of that keyterm in the document)
or other meta-data that is associated with a document through holistic
analysis of the document.

C_2. In the case where an Annotation points into another Annotation (e.g.
the Clause within a Quotation example in 5.3.3.1) via a LocalSofaReference,
are those offsets relative to the Local Sofa (Quotation), or the document as
a whole?  These are two different choices, and which is selected for the
standard should be spelled out.  If the offsets are relative to the span of
the other Annotation, there will be some "offset management" that needs to
be done to map offsets back to document spans.  And what if the "source"
Annotation doesn't have document spans specified?  [See also OI_1 which is
germaine here.]

C_3.  Examples should use features introduced elsewhere, e.g. "beginChar"
and "endChar" rather than "begin" and "end".

===================================
Votable issues 
===================================

(I think we should identify these after we discuss the other sections)

===================================
Open issues 
===================================

OI_1. I believe that there is some confusion in the document about the use
of the concept of a "sofa" (Subject of Analysis).  In Section 5.1.3, sofas
are defined as the content an annotation refers to.  This is also how it is
used in 5.3.2, where Annotations have sofa features.  In other text, e.g.
5.3.4.1, and in the current UIMA SDK Reference manual, a sofa refers
generically to a view of a document (e.g. a textual transcription of a audio
stream, an English translation of a French document, or an ASCII text
version of a PDF document).  Then 5.3.4.2 explicitly introduces the concept
of a "View". Perhaps what is intended is that an Annotation refers to a span
within a specific view.  Does this then imply that a sofaObject in a
SofaReference isn't just identified through a single identifier (id="1" in
the Quotation example), but through a compound of View name and sofaObject
ID?  But this would require individual SofaReferences to know which Views
they are a part of, rather than the View pointing to the relevant collection
of CAS objects.  As you can tell, I'm not sure I actually have a problem
with what is currently proposed in the spec -- it may be the cleanest way to
handle these different facets (in particular because it allows an object in
one View to reference objects in other Views).  But we should work through
it, and clarify the descriptions throughout the spec to consistently use the
terminology "sofa" and "view", and to make sure that offsets are handled
appropriately per discussion above.  The confusion is perhaps highlighted by
the discussion in section 5.3.4.3 in which an AnchoredView is tied to a
specific SofaReference -- again, is it the sofa or the view that captures
the metadata of a particular perspective of a document?

OI_2.  There is an open issue explicitly called out in the text of 5.3.4.1:
whether to define a "RegionalReference" type or to subtype Annotation with
different regional reference mechanisms.  After discussion, the authors of
the section suggest to include an abstract RegionalReference in the base
type system, but not mandate it.  This is probably a sensible approach.  If
we all agree with that, then we'll have to rewrite the section somewhat to
clarify this decision. Of course, if the type system mirrored the Java
class/interface model and thereby supported a kind of multiple inheritance,
some of the disadvantages of subtyping would go away...  But I guess that's
too big of a change.

OI_3. We should discuss Footnote 4 on p. 39.

OI_4. We should also discuss Source Document Information, in 5.3.4.4.  I
have thoughts here, but not enough time right now to spell them out :).

OI_5. Is there anything missing in this section that we'd like to see added?
Other things that fall into the category of "suggested other types we found
useful"?

===================================
Compliance points 
===================================

There are currently three candidate compliance points relevant for the
section:

5.1.3: A UIMA component/framework may be ³annotation model compliant² if it
uses this definition by the UIMA Type-System base model.

5.3.1: A compliant UIMA component/framework may be required to understand
this set of primitive types, and may be required to treat EObject as the
superclass of all classes.

5.3.3: A UIMA component/framework that is "annotation model compliant" may
be required to adhere to the constraint that all Annotation objects must
have a sofa slot that holds a reference to either a LocalSofaReference or a
RemoteSofaReference.


Comments:

Apart from the obvious language change from "may be required" to "is
required" and the removal of the attribution "candidate", there are some
other issues here.

CP_1. I do think that EObject to be assumed to be the superclass of all
classes.  It is important to have a single common superclass for ease of
programming.

CP_2. The 5.1.3 compliance point should be more explicit about what "this
definition" precisely refers to.






_______________________________________________________________
Karin Verspoor, Computational Linguist
Knowledge and Information Systems Science team
Computer, Computation & Statistics division
http://public.lanl.gov/verspoor
email: verspoor@lanl.gov   Mail: Los Alamos National Laboratory
phone: 505-667-5086              PO Box 1663, MS B256
fax:   505-667-1126              Los Alamos, NM 87545
_______________________________________________________________
Follow-Ups:
- RE: [uima] [Type System Base Model subgroup]
  - From: David Ferrucci <ferrucci@us.ibm.com>
References:
- [Type System Base Model subgroup]
  - From: Karin Verspoor <verspoor@lanl.gov>