uima message

Subject: Re: [uima] [Type System Base Model subgroup]
From: "KANO, Yoshinobu" <kano@is.s.u-tokyo.ac.jp>
To: uima@lists.oasis-open.org
Date: Wed, 21 Feb 2007 20:45:15 +0900
Hi all,

** To all members **
This mail contains addresses which are still not on the roaster.
It must be better to add these addresses to Cc.
Or should we use the To field to make sure who are the subgroup members?

# Two of us, "Yoshinobu Kano" and "Ngan Nguyen" have been regsitered
and joined to the mailing list after Dave sent us the full list.

And there may be mistakes about my name ...
though it is not so large issue:
my given name is "Yoshinobu", and my family name is "Kano".


** [Type systen base model] Subgroup Discussions **

* A new issue from my point of view

Related to OI_5. "Suggested Other Types" in the mail from Karin.

The syntactic structures are basically tree structures,
while the syntactic informations are really important
and frequently appear in many NLP cases.
Currently, tree structures can be expressed using general references to 
other annotations.
In this way, it is difficult to extract or distinguish tree structures 
from any graphs of object references,
specially when that is created by other people.

# We cannot express tree structures by stand-off spans (begin/end) only,
because it is impossible to assign orders only from the spans(begin/end)
to annotations which have the same spans.


My proposal is to include a special "tree node" type in the specification;
that special type has a reference to a parent node.

You may argue that this is too region specific issue.
But I don't think so because arbitrary object graphs have too much power 
of expression,
and a lower class of expression is useful in many cases.

There are two problems in this proposal:
- we cannot assure that a graph which consist of these "nodes" is alway 
a tree
- nodes cannot be shared by two or more trees
We should consider these issues if you accept this proposal.


* Comments for what Karin wrote:

C_1. I agree to Karin's opinion, i.e. make subtypes under the root type,
one without span features and another with span features (Annotation).
# I'm not sure but can we just extend the current "TOP" type
which does not have span features?

OI_1. Isn't it enough to describe in the specification:
"Every "CAS" and "view" should have a unique sofa ID,
and annotations refer by that sofa ID"?

I will consider other issues and send a mail later.

thanks,

Yoshinobu KANO
kano@is.s.u-tokyo.ac.jp
Tsujii Lab., the University of Tokyo

Karin Verspoor wrote:
> Following in Adam's footsteps, I'd like to kick off the discussion for the
> Type System Base Model subgroup.
> 
> According to 
> http://www.oasis-open.org/apps/org/workgroup/uima/download.php/22325/UIMA%20
> TC%20Sub-Groups_v2.pdf, the members are:
> 
> Thomas
> Kano
> Pascal
> Karin (me)
> 
> We are due to report on March 2, which doesn't give us much time for
> discussion or preparing our report.  So let's get started! Below you will
> find some initial thoughts I have had, though I'm sure there are more things
> to discuss and in fact I haven't quite finished getting my thoughts down.
> I'll ask that everyone who is listed above respond to these and bring up new
> issues/discussion points by Thursday 2/22.  We can then use these to develop
> the action plan for the spec element.
> 
> As a reminder, we are tasked with the following:
> 
> ===================================
> Sub-Groups reports should include:
> ===================================
> 
> 1. Goals of spec element. What is it trying to achieve in terms of
> interoperability? 
> 2. Overall Critique of section. High-level summary of findings. How good/bad
> is it in meeting goals. What's the damage? Looks good, just needs some
> wordsmithing, has some serious conceptual issues, etc.
> 
> 3. "Votable" issues. Crisp decisions the TC should vote on required to
> harden/complete spec element.
> 4. Open-Issues. Issues that need extended discussion to resolve
> 5. List of compliance points. What aspects of this spec element "can",
> "must" be adhered to in order to  be "compliant")
> 6. Action Plan. Very important -- List of tasks required to bring spec
> element to completion.
> 
> 
> 
> ===================================
> Goals of spec element
> ===================================
> 
> Section 5.3 of the specification aims to define the set of predefined types
> that are assumed to be available to any UIMA-compliant analytic or system.
> 
> The spec adopts the primitive types defined by Ecore, covering String,
> Boolean, Byte, Short, Int, Long, Float and Double.  The main primitive Java
> type missing from this list is Char, but this is not defined by Ecore and
> can be handled using Int.
> 
> ===================================
> Critique of section
> ===================================
> 
> I believe that this is the first place in the spec where Annotations are
> really defined.  However, some details are left unspecified and I think the
> section as a whole could be reorganized to clarify the data model for
> Annotations -- to formalize what is discussed in 5.1.3.  Note that really,
> 5.1.3 and the whole of 5.3 need to be worked on together.
> 
> C_1. For instance, in the intro to section 5, it is stated that "we defined
> a type Annotation to represent objects that have regional references (e.g.,
> offsets) into the value of an attribute of another object." but the section
> itself doesn't specify explicitly that Annotations should have begin/end
> features to indicate those offsets (though they are used in the example)
> right at the front (though this comes in in 5.3.4.1), and doesn't clearly
> specify what those offsets should be relative to (see C_2 and OI-1).
>   C_1a. I don't think it is appropriate for all Annotations to have
> begin/end features, and I do prefer the approach to adding this in via
> subtyping.  This is because there will be Annotations that don't correspond
> to explicit individual text spans, e.g. document keyterms that are extracted
> through statistical analysis of the document as a whole (where one may not
> wish to annotate each individual instance of that keyterm in the document)
> or other meta-data that is associated with a document through holistic
> analysis of the document.
> 
> C_2. In the case where an Annotation points into another Annotation (e.g.
> the Clause within a Quotation example in 5.3.3.1) via a LocalSofaReference,
> are those offsets relative to the Local Sofa (Quotation), or the document as
> a whole?  These are two different choices, and which is selected for the
> standard should be spelled out.  If the offsets are relative to the span of
> the other Annotation, there will be some "offset management" that needs to
> be done to map offsets back to document spans.  And what if the "source"
> Annotation doesn't have document spans specified?  [See also OI_1 which is
> germaine here.]
> 
> C_3.  Examples should use features introduced elsewhere, e.g. "beginChar"
> and "endChar" rather than "begin" and "end".
> 
> ===================================
> Votable issues 
> ===================================
> 
> (I think we should identify these after we discuss the other sections)
> 
> ===================================
> Open issues 
> ===================================
> 
> OI_1. I believe that there is some confusion in the document about the use
> of the concept of a "sofa" (Subject of Analysis).  In Section 5.1.3, sofas
> are defined as the content an annotation refers to.  This is also how it is
> used in 5.3.2, where Annotations have sofa features.  In other text, e.g.
> 5.3.4.1, and in the current UIMA SDK Reference manual, a sofa refers
> generically to a view of a document (e.g. a textual transcription of a audio
> stream, an English translation of a French document, or an ASCII text
> version of a PDF document).  Then 5.3.4.2 explicitly introduces the concept
> of a "View". Perhaps what is intended is that an Annotation refers to a span
> within a specific view.  Does this then imply that a sofaObject in a
> SofaReference isn't just identified through a single identifier (id="1" in
> the Quotation example), but through a compound of View name and sofaObject
> ID?  But this would require individual SofaReferences to know which Views
> they are a part of, rather than the View pointing to the relevant collection
> of CAS objects.  As you can tell, I'm not sure I actually have a problem
> with what is currently proposed in the spec -- it may be the cleanest way to
> handle these different facets (in particular because it allows an object in
> one View to reference objects in other Views).  But we should work through
> it, and clarify the descriptions throughout the spec to consistently use the
> terminology "sofa" and "view", and to make sure that offsets are handled
> appropriately per discussion above.  The confusion is perhaps highlighted by
> the discussion in section 5.3.4.3 in which an AnchoredView is tied to a
> specific SofaReference -- again, is it the sofa or the view that captures
> the metadata of a particular perspective of a document?
> 
> OI_2.  There is an open issue explicitly called out in the text of 5.3.4.1:
> whether to define a "RegionalReference" type or to subtype Annotation with
> different regional reference mechanisms.  After discussion, the authors of
> the section suggest to include an abstract RegionalReference in the base
> type system, but not mandate it.  This is probably a sensible approach.  If
> we all agree with that, then we'll have to rewrite the section somewhat to
> clarify this decision. Of course, if the type system mirrored the Java
> class/interface model and thereby supported a kind of multiple inheritance,
> some of the disadvantages of subtyping would go away...  But I guess that's
> too big of a change.
> 
> OI_3. We should discuss Footnote 4 on p. 39.
> 
> OI_4. We should also discuss Source Document Information, in 5.3.4.4.  I
> have thoughts here, but not enough time right now to spell them out :).
> 
> OI_5. Is there anything missing in this section that we'd like to see added?
> Other things that fall into the category of "suggested other types we found
> useful"?
> 
> ===================================
> Compliance points 
> ===================================
> 
> There are currently three candidate compliance points relevant for the
> section:
> 
> 5.1.3: A UIMA component/framework may be ³annotation model compliant² if it
> uses this definition by the UIMA Type-System base model.
> 
> 5.3.1: A compliant UIMA component/framework may be required to understand
> this set of primitive types, and may be required to treat EObject as the
> superclass of all classes.
> 
> 5.3.3: A UIMA component/framework that is "annotation model compliant" may
> be required to adhere to the constraint that all Annotation objects must
> have a sofa slot that holds a reference to either a LocalSofaReference or a
> RemoteSofaReference.
> 
> 
> Comments:
> 
> Apart from the obvious language change from "may be required" to "is
> required" and the removal of the attribution "candidate", there are some
> other issues here.
> 
> CP_1. I do think that EObject to be assumed to be the superclass of all
> classes.  It is important to have a single common superclass for ease of
> programming.
> 
> CP_2. The 5.1.3 compliance point should be more explicit about what "this
> definition" precisely refers to.
> 
> 
> 
> 
> 
> 
> _______________________________________________________________
> Karin Verspoor, Computational Linguist
> Knowledge and Information Systems Science team
> Computer, Computation & Statistics division
> http://public.lanl.gov/verspoor
> email: verspoor@lanl.gov   Mail: Los Alamos National Laboratory
> phone: 505-667-5086              PO Box 1663, MS B256
> fax:   505-667-1126              Los Alamos, NM 87545
> _______________________________________________________________
> 
> 
>
References:
- [Type System Base Model subgroup]
  - From: Karin Verspoor <verspoor@lanl.gov>