uima message

Subject: Re: [uima] [Type System Base Model subgroup]

From: Thilo W Goetz <TGOETZ@de.ibm.com>
To: Karin Verspoor <verspoor@lanl.gov>
Date: Wed, 7 Mar 2007 18:26:27 +0100

Karin Verspoor <verspoor@lanl.gov> wrote on 03/07/2007 18:08:06: > I agree that if we are going to provide support for such structures, we > should provide support for the most general of such structures, e.g. Graphs. > Since Trees are a special case of Graphs, this covers trees and other > relational structures. Constraining this to DAGs probably makes sense, > > I do understand your concern about object proliferation in the proposal I > circulated. I also understand that "sometimes we want to express the region > which a node governs". However, I think while the latter point will be very > common for parse trees (where e.g. a single parse tree corresponds to a > contiguous span like a sentence) it is not generally true of these sorts of > hierarchical Annotation collections and by subtyping Annotation to add the > graph support, we would be requiring AnnotationGraphs to be tied to specific > sofa spans. I'm imagining use cases for the AnnotationGraphs to include > things like the collection of co-references corresponding to a particular > entity, which will clearly not correspond to contiguous spans. Of course, > if individual Annotations can have non-contiguous spans this is less of a > problem and subtyping could work. Anyone else have a preference for a > particular design?
It's these kinds of discussions that make me very suspicious of adding more special-purpose types to the UIMA framework. In particular, I don't think that the standards group should be designing these top-down. I think it would be better to come up with a practical proposal for the actual implementation, and see if other people adopt it. If people seem to converge on a common design for these things, then there's still time to standardize them and put them in the OASIS work.

> > Actually, I should modify my proposal not to explicitly encode "parent" and > "child" but rather to simply represent inlinks and outlinks; that way we can > leave the semantics of the DAG up to the specific use case. > > I do see your point about unidirectional representation. However, this > would require traversal of these structures to always proceed bottom-up in > the case of trees, which might not be natural. > > Good point about validation of the structures -- to what extent is the > framework responsible for checking/enforcing constraints on structures in > other areas?

It isn't. The framework doesn't even check annotation ranges at this time (in the sense that begin and end positions can have values that are, for example, negative). Particular APIs like getCoveredText() will check that, but the framework does not prevent you from entering illegal values. There are two possible approaches to this kind of thing (in my opinion): either you allow users to set values freely, then it's their business to do the checking. That's the UIMA approach to annotation ranges, for example. Or, the implementation is hidden and users must use a special API that ensures consistency. Sort of like Sofas in UIMA. So if we define trees in the type system, and the user can manipulate them with regular CAS APIs, then it's their job to ensure consistency of the representation.
> > Karin > > On 3/4/07 2:23 PM, "KANO, Yoshinobu" <kano@is.s.u-tokyo.ac.jp> wrote: > > > Here is a refined version of my proposal about the tree or graph structures, > > "Open issue 9." of the latest mail from Karin. > > This issue is not fully discussed in the previous telecon, > > because we didn't have much time. > > > > > > * Purpose of this proposal > > > > The purpose of this proposal is to provide references > > which have a limited use for the widely used specific data structures, > > trees or/and DAGs, with least cost of efficiency. > > > > These structures can be used to explicitly express syntactic trees, > > shared-node multiple trees, etc. > > > > > > * Problems in the current specification/implementation > > > > Suppose an UIMA component which generates CFG parse trees. > > For example, a CFG tree may contain Annotations of "phrases" and "words", > > and these Annotations have a specific type of syntactic relations > > between them. > > > > We cannot express unary relations with just begin/end pairs, > > so we must use "references" between annotations to express relations. > > > > Because the "reference" defined in the current specification is generic, > > we don't have a mean to tell the class of the information structures > > to other UIMA components. > > > > > > * Classes of structures > > > > A basic restriction to the references is not to contain any cycles. > > It is really useful to assure that a set of references are acyclic. > > > > [Tree] Tree structures, acyclic and only a single parent is permitted. > > [DAG] Directed Acyclic Graphs, acyclic and multiple parents are permitted. > > > > My proposal is mainly to provide tree structures, > > because structures are essentially skeleton trees > > with additional links between nodes, in most cases in the NLP field. > > > > But I think it is also important to provide DAGs. > > For example, we can naturally express multiple tree structures > > (like many parse candidates) by a single DAG. > > In this case trees can share nodes and we can avoid the combinatorial > > explosion. > > > > * Type hierarchy and View-like new type > > > > The main advantage of the <AnnotationGraphNode> (View-like new type) > > may be that it can separate the structure and leaf Annotations explicitly. > > But the separation of the structure and Annotations generates > > more objects, i.e. decrease the efficiency. > > On the other hand, sometimes we want to express the region which a node > > governs. > > For these reasons I would prefer to define a class under <Annotation>. > > In this case, I think it is better to provide the references to > > roots, leaves, or entire collection of nodes as an option. > > > > If we provide both [Tree] and [DAG], the type hierarchy is: > > <Annotation> -subtype-> <Tree Node>, > > <Annotation> -subtype-> <DAG Node>. > > > > It seems to be natural to define them as: > > <Annotation> -subtype-> <DAG Node> -subtype-> <Tree Node>, > > because trees are also DAGs. > > But <DAG Node> will have more features(references) than <Tree Node>, > > so it is better not to inherit <DAG Node>. > > > > If the case is [Tree] only: > > <Annotation> -subtype-> <Tree Node>, > > > > > > * Expression of structures > > > > Candidates of "references" are "parent(s)" and "children". > > If we do not provide "children" references, > > then the order of children will be resolved by begin/end positions. > > It is easy because Annotations are currently sorted by begin/end. > > > > a. unidirectional > > > > [Tree] It is better to use "parent", not "children" as references > > because we can explicitly limit the number of parents to one. > > > > [DAG] "parents" or "children". > > > > b. bidirectional > > > > I don't support bidirectional references > > because there can be miss-linked parent-child pairs, > > though it is convenient to provide bidirectional references. > > There is another problem in efficiency that the memory requirement > > increases. > > > > > > * Validation of structures > > > > I think it is not good for the efficiency if the UIMA system always check > > the structures whether they are trees/DAGs or not. > > But it is also annoying to check that in the component side. > > > > My proposal is to provide a check system, and users can switch on/off > > to perform the validations by the component descriptor or something. > > I'm not sure how much should these things be included in the specification. > > > > In the case of bidirectional references, one may have a need to validate > > whether the parent-child relations correctly match or not. > > This is another reason why I don't support the bidirectional references. > > > > > > > > * Summary > > > > I propose to define <TreeNode> and <DAGNode> types under <Annotation>. > > In my opinion, references are unidirectional: > > "parent" for [Tree], "parents" or "children" for [DAG]. > > A structure validation system must be provided as an optional one. > > > > Options: > > 1. Provide a [Tree] structure specific type or not > > 2. Provide a [DAG] structure specific type or not > > > > > > best, > > > > Yoshinobu KANO > > kano@is.s.u-tokyo.ac.jp > > Tsujii Lab., the University of Tokyo > > _______________________________________________________________ > Karin Verspoor, Computational Linguist > Knowledge and Information Systems Science team > Computer, Computation & Statistics division > http://public.lanl.gov/verspoor > email: verspoor@lanl.gov Mail: Los Alamos National Laboratory > phone: 505-667-5086 PO Box 1663, MS B256 > fax: 505-667-1126 Los Alamos, NM 87545 > _______________________________________________________________ > > >

Mit freundlichen Gruessen / Best regards

Thilo Goetz
OmniFind & UIMA development
Information Management Division
IBM Germany
+49-7031-16-1758

IBM Deutschland Entwicklung GmbH
Vorsitzender des Aufsichtsrats: Johann Weihen
Geschäftsführung: Herbert Kircher
Sitz der Gesellschaft: Böblingen
Registergericht: Amtsgericht Stuttgart, HRB 243294

Follow-Ups:
- Re: [uima] [Type System Base Model subgroup]
  - From: "KANO, Yoshinobu" <kano@is.s.u-tokyo.ac.jp>

References:
- Re: [uima] [Type System Base Model subgroup]
  - From: Karin Verspoor <verspoor@lanl.gov>