[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: RE: [uima] [Type System Base Model subgroup]
>.. there will be Annotations that don't correspond to explicit >individual text spans, e.g. document keyterms that are extracted through >statistical analysis of the document as a whole (where one may not wish to >annotate each individual instance of that keyterm in the document) This is true, and I am experiencing this problem myself now. I am wrapping up a term extractor (named Termine) into UIMA, and found it is problematic to assign offsets to the terms extracted from a document (or documents). Because the terms are identified using co-occurrence statistical information, the terms extracted do not necessarily correspond to every identical text string/text span occurring in the document(s). If the terms include discontinuous multiword terms, things can be worse. So I think the top Annotation type should not have offset features, and a subtype can be defined including offsets. Cheers Scott ----------------------------------- Dr. Scott Piao NaCTeM & School of computer Science University of Manchester UK -----Original Message----- From: Karin Verspoor [mailto:verspoor@lanl.gov] Sent: 21 February 2007 01:16 To: thomas.hampp@de.ibm.com; kano@is.s.u-tokyo.ac.jp; pascal.coupet@temis.com Cc: uima@lists.oasis-open.org Subject: [uima] [Type System Base Model subgroup] Following in Adam's footsteps, I'd like to kick off the discussion for the Type System Base Model subgroup. According to http://www.oasis-open.org/apps/org/workgroup/uima/download.php/22325/UIMA%20 TC%20Sub-Groups_v2.pdf, the members are: Thomas Kano Pascal Karin (me) We are due to report on March 2, which doesn't give us much time for discussion or preparing our report. So let's get started! Below you will find some initial thoughts I have had, though I'm sure there are more things to discuss and in fact I haven't quite finished getting my thoughts down. I'll ask that everyone who is listed above respond to these and bring up new issues/discussion points by Thursday 2/22. We can then use these to develop the action plan for the spec element. As a reminder, we are tasked with the following: =================================== Sub-Groups reports should include: =================================== 1. Goals of spec element. What is it trying to achieve in terms of interoperability? 2. Overall Critique of section. High-level summary of findings. How good/bad is it in meeting goals. What's the damage? Looks good, just needs some wordsmithing, has some serious conceptual issues, etc. 3. "Votable" issues. Crisp decisions the TC should vote on required to harden/complete spec element. 4. Open-Issues. Issues that need extended discussion to resolve 5. List of compliance points. What aspects of this spec element "can", "must" be adhered to in order to be "compliant") 6. Action Plan. Very important -- List of tasks required to bring spec element to completion. =================================== Goals of spec element =================================== Section 5.3 of the specification aims to define the set of predefined types that are assumed to be available to any UIMA-compliant analytic or system. The spec adopts the primitive types defined by Ecore, covering String, Boolean, Byte, Short, Int, Long, Float and Double. The main primitive Java type missing from this list is Char, but this is not defined by Ecore and can be handled using Int. =================================== Critique of section =================================== I believe that this is the first place in the spec where Annotations are really defined. However, some details are left unspecified and I think the section as a whole could be reorganized to clarify the data model for Annotations -- to formalize what is discussed in 5.1.3. Note that really, 5.1.3 and the whole of 5.3 need to be worked on together. C_1. For instance, in the intro to section 5, it is stated that "we defined a type Annotation to represent objects that have regional references (e.g., offsets) into the value of an attribute of another object." but the section itself doesn't specify explicitly that Annotations should have begin/end features to indicate those offsets (though they are used in the example) right at the front (though this comes in in 5.3.4.1), and doesn't clearly specify what those offsets should be relative to (see C_2 and OI-1). C_1a. I don't think it is appropriate for all Annotations to have begin/end features, and I do prefer the approach to adding this in via subtyping. This is because there will be Annotations that don't correspond to explicit individual text spans, e.g. document keyterms that are extracted through statistical analysis of the document as a whole (where one may not wish to annotate each individual instance of that keyterm in the document) or other meta-data that is associated with a document through holistic analysis of the document. C_2. In the case where an Annotation points into another Annotation (e.g. the Clause within a Quotation example in 5.3.3.1) via a LocalSofaReference, are those offsets relative to the Local Sofa (Quotation), or the document as a whole? These are two different choices, and which is selected for the standard should be spelled out. If the offsets are relative to the span of the other Annotation, there will be some "offset management" that needs to be done to map offsets back to document spans. And what if the "source" Annotation doesn't have document spans specified? [See also OI_1 which is germaine here.] C_3. Examples should use features introduced elsewhere, e.g. "beginChar" and "endChar" rather than "begin" and "end". =================================== Votable issues =================================== (I think we should identify these after we discuss the other sections) =================================== Open issues =================================== OI_1. I believe that there is some confusion in the document about the use of the concept of a "sofa" (Subject of Analysis). In Section 5.1.3, sofas are defined as the content an annotation refers to. This is also how it is used in 5.3.2, where Annotations have sofa features. In other text, e.g. 5.3.4.1, and in the current UIMA SDK Reference manual, a sofa refers generically to a view of a document (e.g. a textual transcription of a audio stream, an English translation of a French document, or an ASCII text version of a PDF document). Then 5.3.4.2 explicitly introduces the concept of a "View". Perhaps what is intended is that an Annotation refers to a span within a specific view. Does this then imply that a sofaObject in a SofaReference isn't just identified through a single identifier (id="1" in the Quotation example), but through a compound of View name and sofaObject ID? But this would require individual SofaReferences to know which Views they are a part of, rather than the View pointing to the relevant collection of CAS objects. As you can tell, I'm not sure I actually have a problem with what is currently proposed in the spec -- it may be the cleanest way to handle these different facets (in particular because it allows an object in one View to reference objects in other Views). But we should work through it, and clarify the descriptions throughout the spec to consistently use the terminology "sofa" and "view", and to make sure that offsets are handled appropriately per discussion above. The confusion is perhaps highlighted by the discussion in section 5.3.4.3 in which an AnchoredView is tied to a specific SofaReference -- again, is it the sofa or the view that captures the metadata of a particular perspective of a document? OI_2. There is an open issue explicitly called out in the text of 5.3.4.1: whether to define a "RegionalReference" type or to subtype Annotation with different regional reference mechanisms. After discussion, the authors of the section suggest to include an abstract RegionalReference in the base type system, but not mandate it. This is probably a sensible approach. If we all agree with that, then we'll have to rewrite the section somewhat to clarify this decision. Of course, if the type system mirrored the Java class/interface model and thereby supported a kind of multiple inheritance, some of the disadvantages of subtyping would go away... But I guess that's too big of a change. OI_3. We should discuss Footnote 4 on p. 39. OI_4. We should also discuss Source Document Information, in 5.3.4.4. I have thoughts here, but not enough time right now to spell them out :). OI_5. Is there anything missing in this section that we'd like to see added? Other things that fall into the category of "suggested other types we found useful"? =================================== Compliance points =================================== There are currently three candidate compliance points relevant for the section: 5.1.3: A UIMA component/framework may be ³annotation model compliant² if it uses this definition by the UIMA Type-System base model. 5.3.1: A compliant UIMA component/framework may be required to understand this set of primitive types, and may be required to treat EObject as the superclass of all classes. 5.3.3: A UIMA component/framework that is "annotation model compliant" may be required to adhere to the constraint that all Annotation objects must have a sofa slot that holds a reference to either a LocalSofaReference or a RemoteSofaReference. Comments: Apart from the obvious language change from "may be required" to "is required" and the removal of the attribution "candidate", there are some other issues here. CP_1. I do think that EObject to be assumed to be the superclass of all classes. It is important to have a single common superclass for ease of programming. CP_2. The 5.1.3 compliance point should be more explicit about what "this definition" precisely refers to. _______________________________________________________________ Karin Verspoor, Computational Linguist Knowledge and Information Systems Science team Computer, Computation & Statistics division http://public.lanl.gov/verspoor email: verspoor@lanl.gov Mail: Los Alamos National Laboratory phone: 505-667-5086 PO Box 1663, MS B256 fax: 505-667-1126 Los Alamos, NM 87545 _______________________________________________________________
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]