uima message

Subject: Re: [uima] [Type System Base Model subgroup]
From: "KANO, Yoshinobu" <kano@is.s.u-tokyo.ac.jp>
To: "uima@lists.oasis-open.org" <uima@lists.oasis-open.org>
Date: Mon, 05 Mar 2007 06:23:43 +0900
Here is a refined version of my proposal about the tree or graph structures,
"Open issue 9." of the latest mail from Karin.
This issue is not fully discussed in the previous telecon,
because we didn't have much time.


* Purpose of this proposal

The purpose of this proposal is to provide references
which have a limited use for the widely used specific data structures,
trees or/and DAGs, with least cost of efficiency.

These structures can be used to explicitly express syntactic trees,
shared-node multiple trees, etc.


* Problems in the current specification/implementation

Suppose an UIMA component which generates CFG parse trees.
For example, a CFG tree may contain Annotations of "phrases" and "words",
and these Annotations have a specific type of syntactic relations 
between them.

We cannot express unary relations with just begin/end pairs,
so we must use "references" between annotations to express relations.

Because the "reference" defined in the current specification is generic,
we don't have a mean to tell the class of the information structures
to other UIMA components.


* Classes of structures

A basic restriction to the references is not to contain any cycles.
It is really useful to assure that a set of references are acyclic.

[Tree] Tree structures, acyclic and only a single parent is permitted.
[DAG] Directed Acyclic Graphs, acyclic and multiple parents are permitted.

My proposal is mainly to provide tree structures,
because structures are essentially skeleton trees
with additional links between nodes, in most cases in the NLP field.

But I think it is also important to provide DAGs.
For example, we can naturally express multiple tree structures
(like many parse candidates) by a single DAG.
In this case trees can share nodes and we can avoid the combinatorial 
explosion.

* Type hierarchy and View-like new type

The main advantage of the <AnnotationGraphNode> (View-like new type)
may be that it can separate the structure and leaf Annotations explicitly.
But the separation of the structure and Annotations generates
more objects, i.e. decrease the efficiency.
On the other hand, sometimes we want to express the region which a node 
governs.
For these reasons I would prefer to define a class under <Annotation>.
In this case, I think it is better to provide the references to
roots, leaves, or entire collection of nodes as an option.

If we provide both [Tree] and [DAG], the type hierarchy is:
   <Annotation> -subtype-> <Tree Node>,
   <Annotation> -subtype-> <DAG Node>.

It seems to be natural to define them as:
   <Annotation> -subtype-> <DAG Node> -subtype-> <Tree Node>,
because trees are also DAGs.
But <DAG Node> will have more features(references) than <Tree Node>,
so it is better not to inherit <DAG Node>.

If the case is [Tree] only:
   <Annotation> -subtype-> <Tree Node>,


* Expression of structures

Candidates of "references" are "parent(s)" and "children".
If we do not provide "children" references,
then the order of children will be resolved by begin/end positions.
It is easy because Annotations are currently sorted by begin/end.

a. unidirectional

[Tree] It is better to use "parent", not "children" as references
because we can explicitly limit the number of parents to one.

[DAG] "parents" or "children".

b. bidirectional

I don't support bidirectional references
because there can be miss-linked parent-child pairs,
though it is convenient to provide bidirectional references.
There is another problem in efficiency that the memory requirement 
increases.


* Validation of structures

I think it is not good for the efficiency if the UIMA system always check
the structures whether they are trees/DAGs or not.
But it is also annoying to check that in the component side.

My proposal is to provide a check system, and users can switch on/off
to perform the validations by the component descriptor or something.
I'm not sure how much should these things be included in the specification.

In the case of bidirectional references, one may have a need to validate
whether the parent-child relations correctly match or not.
This is another reason why I don't support the bidirectional references.



* Summary

I propose to define <TreeNode> and <DAGNode> types under <Annotation>.
In my opinion, references are unidirectional:
"parent" for [Tree], "parents" or "children" for [DAG].
A structure validation system must be provided as an optional one.

Options:
1. Provide a [Tree] structure specific type or not
2. Provide a [DAG] structure specific type or not


best,

Yoshinobu KANO
kano@is.s.u-tokyo.ac.jp
Tsujii Lab., the University of Tokyo
Follow-Ups:
- Re: [uima] [Type System Base Model subgroup]
  - From: Karin Verspoor <verspoor@lanl.gov>
References:
- [Type System Base Model subgroup]
  - From: Karin Verspoor <verspoor@lanl.gov>