Subject: Re: [uima] [Type System Base Model subgroup]
we have traditionally reserved Annotation to be an object in the CAS that does refer to a span (or more generally, have a regional reference which can refer to arbitrary defined region -- not just a span)
our practice has been, if a object in the CAS does not label/"annotate a span (or more generally a region) we typically do not make it an Annotation.
Objects that describe extracted elements, rather than explicit regions of a sofa, for example, may refer to them through arbitrary features, but we have not thought of these as "annotations"
I am not sure what a more general Annotation type would mean if it does not explicitly refer to some region in the artifact which it 'annotates' - seems then at that level of description, it would indistinguishable from any other object in the CAS.
referring to discontinuous regions is a requirement that has come up from time to time and I believe the current draft does not discuss it. I think a worthy requirement needing further discussion.
David A. Ferrucci, PhD
Senior Manager, Semantic Analysis & Integration
Chief Architect, UIMA
IBM T.J. Watson Research Center
19 Skyline Drive, Hawthorne, NY 10532
Tel: 914-784-7847, 8/863-7847
"Scott Songlin Piao" <email@example.com> 02/21/2007 05:40 AM
"Karin Verspoor" <firstname.lastname@example.org>, "email@example.com" <firstname.lastname@example.org>, "email@example.com" <firstname.lastname@example.org>, "email@example.com" <firstname.lastname@example.org>
RE: [uima] [Type System Base Model subgroup]
>.. there will be Annotations that don't correspond to explicit
>individual text spans, e.g. document keyterms that are extracted through
>statistical analysis of the document as a whole (where one may not wish to
>annotate each individual instance of that keyterm in the document)
This is true, and I am experiencing this problem myself now. I am wrapping up a term extractor (named Termine) into UIMA, and found it is problematic to assign offsets to the terms extracted from a document (or documents). Because the terms are identified using co-occurrence statistical information, the terms extracted do not necessarily correspond to every identical text string/text span occurring in the document(s). If the terms include discontinuous multiword terms, things can be worse.
So I think the top Annotation type should not have offset features, and a subtype can be defined including offsets.
Dr. Scott Piao
NaCTeM & School of computer Science
University of Manchester
From: Karin Verspoor [mailto:email@example.com]
Sent: 21 February 2007 01:16
To: firstname.lastname@example.org; email@example.com; firstname.lastname@example.org
Subject: [uima] [Type System Base Model subgroup]
Following in Adam's footsteps, I'd like to kick off the discussion for the
Type System Base Model subgroup.
TC%20Sub-Groups_v2.pdf, the members are:
We are due to report on March 2, which doesn't give us much time for
discussion or preparing our report. So let's get started! Below you will
find some initial thoughts I have had, though I'm sure there are more things
to discuss and in fact I haven't quite finished getting my thoughts down.
I'll ask that everyone who is listed above respond to these and bring up new
issues/discussion points by Thursday 2/22. We can then use these to develop
the action plan for the spec element.
As a reminder, we are tasked with the following:
Sub-Groups reports should include:
1. Goals of spec element. What is it trying to achieve in terms of
2. Overall Critique of section. High-level summary of findings. How good/bad
is it in meeting goals. What's the damage? Looks good, just needs some
wordsmithing, has some serious conceptual issues, etc.
3. "Votable" issues. Crisp decisions the TC should vote on required to
harden/complete spec element.
4. Open-Issues. Issues that need extended discussion to resolve
5. List of compliance points. What aspects of this spec element "can",
"must" be adhered to in order to be "compliant")
6. Action Plan. Very important -- List of tasks required to bring spec
element to completion.
Goals of spec element
Section 5.3 of the specification aims to define the set of predefined types
that are assumed to be available to any UIMA-compliant analytic or system.
The spec adopts the primitive types defined by Ecore, covering String,
Boolean, Byte, Short, Int, Long, Float and Double. The main primitive Java
type missing from this list is Char, but this is not defined by Ecore and
can be handled using Int.
Critique of section
I believe that this is the first place in the spec where Annotations are
really defined. However, some details are left unspecified and I think the
section as a whole could be reorganized to clarify the data model for
Annotations -- to formalize what is discussed in 5.1.3. Note that really,
5.1.3 and the whole of 5.3 need to be worked on together.
C_1. For instance, in the intro to section 5, it is stated that "we defined
a type Annotation to represent objects that have regional references (e.g.,
offsets) into the value of an attribute of another object." but the section
itself doesn't specify explicitly that Annotations should have begin/end
features to indicate those offsets (though they are used in the example)
right at the front (though this comes in in 18.104.22.168), and doesn't clearly
specify what those offsets should be relative to (see C_2 and OI-1).
C_1a. I don't think it is appropriate for all Annotations to have
begin/end features, and I do prefer the approach to adding this in via
subtyping. This is because there will be Annotations that don't correspond
to explicit individual text spans, e.g. document keyterms that are extracted
through statistical analysis of the document as a whole (where one may not
wish to annotate each individual instance of that keyterm in the document)
or other meta-data that is associated with a document through holistic
analysis of the document.
C_2. In the case where an Annotation points into another Annotation (e.g.
the Clause within a Quotation example in 22.214.171.124) via a LocalSofaReference,
are those offsets relative to the Local Sofa (Quotation), or the document as
a whole? These are two different choices, and which is selected for the
standard should be spelled out. If the offsets are relative to the span of
the other Annotation, there will be some "offset management" that needs to
be done to map offsets back to document spans. And what if the "source"
Annotation doesn't have document spans specified? [See also OI_1 which is
C_3. Examples should use features introduced elsewhere, e.g. "beginChar"
and "endChar" rather than "begin" and "end".
(I think we should identify these after we discuss the other sections)
OI_1. I believe that there is some confusion in the document about the use
of the concept of a "sofa" (Subject of Analysis). In Section 5.1.3, sofas
are defined as the content an annotation refers to. This is also how it is
used in 5.3.2, where Annotations have sofa features. In other text, e.g.
126.96.36.199, and in the current UIMA SDK Reference manual, a sofa refers
generically to a view of a document (e.g. a textual transcription of a audio
stream, an English translation of a French document, or an ASCII text
version of a PDF document). Then 188.8.131.52 explicitly introduces the concept
of a "View". Perhaps what is intended is that an Annotation refers to a span
within a specific view. Does this then imply that a sofaObject in a
SofaReference isn't just identified through a single identifier (id="1" in
the Quotation example), but through a compound of View name and sofaObject
ID? But this would require individual SofaReferences to know which Views
they are a part of, rather than the View pointing to the relevant collection
of CAS objects. As you can tell, I'm not sure I actually have a problem
with what is currently proposed in the spec -- it may be the cleanest way to
handle these different facets (in particular because it allows an object in
one View to reference objects in other Views). But we should work through
it, and clarify the descriptions throughout the spec to consistently use the
terminology "sofa" and "view", and to make sure that offsets are handled
appropriately per discussion above. The confusion is perhaps highlighted by
the discussion in section 184.108.40.206 in which an AnchoredView is tied to a
specific SofaReference -- again, is it the sofa or the view that captures
the metadata of a particular perspective of a document?
OI_2. There is an open issue explicitly called out in the text of 220.127.116.11:
whether to define a "RegionalReference" type or to subtype Annotation with
different regional reference mechanisms. After discussion, the authors of
the section suggest to include an abstract RegionalReference in the base
type system, but not mandate it. This is probably a sensible approach. If
we all agree with that, then we'll have to rewrite the section somewhat to
clarify this decision. Of course, if the type system mirrored the Java
class/interface model and thereby supported a kind of multiple inheritance,
some of the disadvantages of subtyping would go away... But I guess that's
too big of a change.
OI_3. We should discuss Footnote 4 on p. 39.
OI_4. We should also discuss Source Document Information, in 18.104.22.168. I
have thoughts here, but not enough time right now to spell them out :).
OI_5. Is there anything missing in this section that we'd like to see added?
Other things that fall into the category of "suggested other types we found
There are currently three candidate compliance points relevant for the
5.1.3: A UIMA component/framework may be “annotation model compliant” if it
uses this definition by the UIMA Type-System base model.
5.3.1: A compliant UIMA component/framework may be required to understand
this set of primitive types, and may be required to treat EObject as the
superclass of all classes.
5.3.3: A UIMA component/framework that is "annotation model compliant" may
be required to adhere to the constraint that all Annotation objects must
have a sofa slot that holds a reference to either a LocalSofaReference or a
Apart from the obvious language change from "may be required" to "is
required" and the removal of the attribution "candidate", there are some
other issues here.
CP_1. I do think that EObject to be assumed to be the superclass of all
classes. It is important to have a single common superclass for ease of
CP_2. The 5.1.3 compliance point should be more explicit about what "this
definition" precisely refers to.
Karin Verspoor, Computational Linguist
Knowledge and Information Systems Science team
Computer, Computation & Statistics division
email: email@example.com Mail: Los Alamos National Laboratory
phone: 505-667-5086 PO Box 1663, MS B256
fax: 505-667-1126 Los Alamos, NM 87545