uima message

Subject: RE: [uima] [Type System Base Model subgroup]

From: David Ferrucci <ferrucci@us.ibm.com>
To: "Scott Songlin Piao" <scott.piao@manchester.ac.uk>
Date: Wed, 21 Feb 2007 13:47:45 -0500

we have traditionally reserved Annotation to be an object in the CAS that does refer to a span (or more generally, have a regional reference which can refer to arbitrary defined region -- not just a span)

our practice has been, if a object in the CAS does not label/"annotate a span (or more generally a region) we typically do not make it an Annotation.

Objects that describe extracted elements, rather than explicit regions of a sofa, for example, may refer to them through arbitrary features, but we have not thought of these as "annotations"

I am not sure what a more general Annotation type would mean if it does not explicitly refer to some region in the artifact which it 'annotates' - seems then at that level of description, it would indistinguishable from any other object in the CAS.

referring to discontinuous regions is a requirement that has come up from time to time and I believe the current draft does not discuss it. I think a worthy requirement needing further discussion.

-dave

------------------------------------------------------------------------
David A. Ferrucci, PhD
Senior Manager, Semantic Analysis & Integration
Chief Architect, UIMA
IBM T.J. Watson Research Center
19 Skyline Drive, Hawthorne, NY 10532
Tel: 914-784-7847, 8/863-7847
ferrucci@us.ibm.com
------------------------------------------------------------------------
http://www.ibm.com/research/uima

"Scott Songlin Piao" <scott.piao@manchester.ac.uk>

02/21/2007 05:40 AM

To	"Karin Verspoor" <verspoor@lanl.gov>, "thomas.hampp@de.ibm.com" <thomas.hampp@de.ibm.com>, "kano@is.s.u-tokyo.ac.jp" <kano@is.s.u-tokyo.ac.jp>, "pascal.coupet@temis.com" <pascal.coupet@temis.com>
cc	"uima@lists.oasis-open.org" <uima@lists.oasis-open.org>
Subject	RE: [uima] [Type System Base Model subgroup]

>.. there will be Annotations that don't correspond to explicit >individual text spans, e.g. document keyterms that are extracted through >statistical analysis of the document as a whole (where one may not wish to >annotate each individual instance of that keyterm in the document) This is true, and I am experiencing this problem myself now. I am wrapping up a term extractor (named Termine) into UIMA, and found it is problematic to assign offsets to the terms extracted from a document (or documents). Because the terms are identified using co-occurrence statistical information, the terms extracted do not necessarily correspond to every identical text string/text span occurring in the document(s). If the terms include discontinuous multiword terms, things can be worse. So I think the top Annotation type should not have offset features, and a subtype can be defined including offsets. Cheers Scott ----------------------------------- Dr. Scott Piao NaCTeM & School of computer Science University of Manchester UK -----Original Message----- From: Karin Verspoor [mailto:verspoor@lanl.gov] Sent: 21 February 2007 01:16 To: thomas.hampp@de.ibm.com; kano@is.s.u-tokyo.ac.jp; pascal.coupet@temis.com Cc: uima@lists.oasis-open.org Subject: [uima] [Type System Base Model subgroup] Following in Adam's footsteps, I'd like to kick off the discussion for the Type System Base Model subgroup. According to http://www.oasis-open.org/apps/org/workgroup/uima/download.php/22325/UIMA%20 TC%20Sub-Groups_v2.pdf, the members are: Thomas Kano Pascal Karin (me) We are due to report on March 2, which doesn't give us much time for discussion or preparing our report. So let's get started! Below you will find some initial thoughts I have had, though I'm sure there are more things to discuss and in fact I haven't quite finished getting my thoughts down. I'll ask that everyone who is listed above respond to these and bring up new issues/discussion points by Thursday 2/22. We can then use these to develop the action plan for the spec element. As a reminder, we are tasked with the following: =================================== Sub-Groups reports should include: =================================== 1. Goals of spec element. What is it trying to achieve in terms of interoperability? 2. Overall Critique of section. High-level summary of findings. How good/bad is it in meeting goals. What's the damage? Looks good, just needs some wordsmithing, has some serious conceptual issues, etc. 3. "Votable" issues. Crisp decisions the TC should vote on required to harden/complete spec element. 4. Open-Issues. Issues that need extended discussion to resolve 5. List of compliance points. What aspects of this spec element "can", "must" be adhered to in order to be "compliant") 6. Action Plan. Very important -- List of tasks required to bring spec element to completion. =================================== Goals of spec element =================================== Section 5.3 of the specification aims to define the set of predefined types that are assumed to be available to any UIMA-compliant analytic or system. The spec adopts the primitive types defined by Ecore, covering String, Boolean, Byte, Short, Int, Long, Float and Double. The main primitive Java type missing from this list is Char, but this is not defined by Ecore and can be handled using Int. =================================== Critique of section =================================== I believe that this is the first place in the spec where Annotations are really defined. However, some details are left unspecified and I think the section as a whole could be reorganized to clarify the data model for Annotations -- to formalize what is discussed in 5.1.3. Note that really, 5.1.3 and the whole of 5.3 need to be worked on together. C_1. For instance, in the intro to section 5, it is stated that "we defined a type Annotation to represent objects that have regional references (e.g., offsets) into the value of an attribute of another object." but the section itself doesn't specify explicitly that Annotations should have begin/end features to indicate those offsets (though they are used in the example) right at the front (though this comes in in 5.3.4.1), and doesn't clearly specify what those offsets should be relative to (see C_2 and OI-1). C_1a. I don't think it is appropriate for all Annotations to have begin/end features, and I do prefer the approach to adding this in via subtyping. This is because there will be Annotations that don't correspond to explicit individual text spans, e.g. document keyterms that are extracted through statistical analysis of the document as a whole (where one may not wish to annotate each individual instance of that keyterm in the document) or other meta-data that is associated with a document through holistic analysis of the document. C_2. In the case where an Annotation points into another Annotation (e.g. the Clause within a Quotation example in 5.3.3.1) via a LocalSofaReference, are those offsets relative to the Local Sofa (Quotation), or the document as a whole? These are two different choices, and which is selected for the standard should be spelled out. If the offsets are relative to the span of the other Annotation, there will be some "offset management" that needs to be done to map offsets back to document spans. And what if the "source" Annotation doesn't have document spans specified? [See also OI_1 which is germaine here.] C_3. Examples should use features introduced elsewhere, e.g. "beginChar" and "endChar" rather than "begin" and "end". =================================== Votable issues =================================== (I think we should identify these after we discuss the other sections) =================================== Open issues =================================== OI_1. I believe that there is some confusion in the document about the use of the concept of a "sofa" (Subject of Analysis). In Section 5.1.3, sofas are defined as the content an annotation refers to. This is also how it is used in 5.3.2, where Annotations have sofa features. In other text, e.g. 5.3.4.1, and in the current UIMA SDK Reference manual, a sofa refers generically to a view of a document (e.g. a textual transcription of a audio stream, an English translation of a French document, or an ASCII text version of a PDF document). Then 5.3.4.2 explicitly introduces the concept of a "View". Perhaps what is intended is that an Annotation refers to a span within a specific view. Does this then imply that a sofaObject in a SofaReference isn't just identified through a single identifier (id="1" in the Quotation example), but through a compound of View name and sofaObject ID? But this would require individual SofaReferences to know which Views they are a part of, rather than the View pointing to the relevant collection of CAS objects. As you can tell, I'm not sure I actually have a problem with what is currently proposed in the spec -- it may be the cleanest way to handle these different facets (in particular because it allows an object in one View to reference objects in other Views). But we should work through it, and clarify the descriptions throughout the spec to consistently use the terminology "sofa" and "view", and to make sure that offsets are handled appropriately per discussion above. The confusion is perhaps highlighted by the discussion in section 5.3.4.3 in which an AnchoredView is tied to a specific SofaReference -- again, is it the sofa or the view that captures the metadata of a particular perspective of a document? OI_2. There is an open issue explicitly called out in the text of 5.3.4.1: whether to define a "RegionalReference" type or to subtype Annotation with different regional reference mechanisms. After discussion, the authors of the section suggest to include an abstract RegionalReference in the base type system, but not mandate it. This is probably a sensible approach. If we all agree with that, then we'll have to rewrite the section somewhat to clarify this decision. Of course, if the type system mirrored the Java class/interface model and thereby supported a kind of multiple inheritance, some of the disadvantages of subtyping would go away... But I guess that's too big of a change. OI_3. We should discuss Footnote 4 on p. 39. OI_4. We should also discuss Source Document Information, in 5.3.4.4. I have thoughts here, but not enough time right now to spell them out :). OI_5. Is there anything missing in this section that we'd like to see added? Other things that fall into the category of "suggested other types we found useful"? =================================== Compliance points =================================== There are currently three candidate compliance points relevant for the section: 5.1.3: A UIMA component/framework may be �annotation model compliant� if it uses this definition by the UIMA Type-System base model. 5.3.1: A compliant UIMA component/framework may be required to understand this set of primitive types, and may be required to treat EObject as the superclass of all classes. 5.3.3: A UIMA component/framework that is "annotation model compliant" may be required to adhere to the constraint that all Annotation objects must have a sofa slot that holds a reference to either a LocalSofaReference or a RemoteSofaReference. Comments: Apart from the obvious language change from "may be required" to "is required" and the removal of the attribution "candidate", there are some other issues here. CP_1. I do think that EObject to be assumed to be the superclass of all classes. It is important to have a single common superclass for ease of programming. CP_2. The 5.1.3 compliance point should be more explicit about what "this definition" precisely refers to. _______________________________________________________________ Karin Verspoor, Computational Linguist Knowledge and Information Systems Science team Computer, Computation & Statistics division http://public.lanl.gov/verspoor email: verspoor@lanl.gov Mail: Los Alamos National Laboratory phone: 505-667-5086 PO Box 1663, MS B256 fax: 505-667-1126 Los Alamos, NM 87545 _______________________________________________________________

Follow-Ups:
- Re: [uima] [Type System Base Model subgroup]
  - From: Karin Verspoor <verspoor@lanl.gov>

References:
- RE: [uima] [Type System Base Model subgroup]
  - From: "Scott Songlin Piao" <scott.piao@manchester.ac.uk>