Good point from Adam about scaling. Is the restriction on multiple
CASes vs
single CASes is based on this underlying concern? If so, I think it
would be helpful to amplify that point in the documentation to help the
reader/designer/programmer. In my experience, scalability often gets
put off until late in the process. If I am not thinking about it (and
clearly my example shows I was not), then the restriction may seem like
an
artificial incumbrance.
For David's point about the analytic considering multiple CAS's to make
a decision: I think this, too, is a good reason to avoid multiple CAS's
on the input. Again, I think this should be amplified in the document.
-
Christopher W. Milner, Ph.D.
Science Applications International Corporation
675 Peter Jefferson Pkwy.
Suite 300
Charlottesville, VA 22911
434-872-8517 (Office)
David Ferrucci wrote:
a few high-level points
behavior meta-data specifications
are
"hard" .. the impose a significant burden on developers but at
the same time they help enable reusability
i like the idea that we can "batch
up" CASes. I can see the need for that. This does however may make
it harder to specify behavior metadata but I am not sure.
For example, The metadata can still
be stated on a "CAS per CAS basis" -- batching them up to improve
transport efficiency for example need not effect semantics of
behavior???
The other reason to batch, however,
is for the analytic to consider multiple CAS's to make a decision.
This is different. The standard way to do this is to build up data in
an
external resource that the analytic can consider to make decisions. If
you don't do that, then you do run into the issue of having to describe
a more complex input specification (harder then to enforce as well). Is
there any sense to define an interface for a CAS Collection. So
abstractly
rather than leaving this entirely to the application, if this is a
common
pattern, we may support analytics dumping interesting CAS into a
default
"CAS Collection" . Analytics can consider this (temporary?) store
(however it is implemented) to make decision about subsequent CAS.
I realize this seems like it may circumvents the spirit of the
behavioral
metadata. But two things mitigate if not eliminate that concern. 1.
This
is essentially the notion of consulting an external resources for
making
decisions which we also intended to allow 2. It was the intent that
Behavioral
metadata need only describe what it does (at some level of abstraction)
to a conforming input CAS, NOT how it does it. The intent is to know
what
is valid input and what sorts of statements the analytic intends to
assert.
The flow-controller modifying the
CAS?
Not sure of all the implications here either. I am leaning toward
the basic principle of separation of concerns -- have the flow-control
control the flow "." If itself can modify the CAS, how far do
we go -- is it an "Annotator", a "CAS Multiplier"
either?
------------------------------------------------------------------------
David A. Ferrucci, PhD
Senior Manager, Semantic Analysis & Integration
Chief Architect, UIMA
IBM T.J. Watson Research Center
19 Skyline Drive, Hawthorne, NY 10532
Tel: 914-784-7847, 8/863-7847
ferrucci@us.ibm.com
------------------------------------------------------------------------
http://www.ibm.com/research/uima
"Christopher W. Milner"
<christopher.w.milner@saic.com>
wrote on 03/29/2007 01:19:19 AM:
> Hello,
>
> I am not sure of the current convention on commenting on email
> (interleaved, or simply written at the top) so I'll just write at
the
> top and will accept constructive comments if there is a "better"
way.
>
> Question 1) I think Adam captured the conversation. I am still
inclined
> to think the ability to process multiple CASes in one call is
valuable.
> I am a bit new to the paradigm and found myself doing the
following
(in
> GATE): I had one PE that created an external (and expensive)
object,
> processed the entire collection of Documents (CASes) using the
object
> and then closed the object. I am not even sure the API allowed
> referencing the object across calls to the PE (outside of statics,
> factories and kludgery). It was all quite easy to do since
I could
> operate on a collection of CASes.
> Not will to fall on my sword about it. Perhaps the UIMA/GATE
> interoperability effort provides some insight?
>
> Question 2, a,b and c: w.r.t. part 1,I am inclined to urge for it.
I've
> seen the overhead of passing small pieces of info (small CASes) up
and
> down various network-related stacks, along with various attending
> negotiations for licenses, Session passing, etc. and would sorely
miss
> the ability to bundle up smaller CASes, where needed.
>
> There is also the question of consistency: if I be for multiple
CAS
on
> the input side, then I should be so on the output side (especially
for a
> pipeline).
>
> But this may be my exposure to GATE speaking here.
>
> Parts 2b and c seem essential for flavors of asynchronous
processing:
> I'm inclined to vote for it.
>
> Question 3a) I am not sure of the value. I see on page 83 a
discussion
> of mapping between type systems and (perhaps) using the flow
controller
> to carry out this mapping. But there seems to be an equally
reasonable
> mechanism using Analytics to carry out this mapping.
>
> I think the bigger issue relates to some future ability to apply
> transformations to the entire pipeline. Does permitting the
> flowcontroller to modify the CAS bollux up some analysis that
might
> "automatically" optimize or compose PEs based on their
pre-conditions,
> capabilities and post-conditions? Does it render some form of
dataflow
> or consistency analysis impossible that might have let to
> parallelization of this work? I am not sure.
>
> Question 3b: not sure.
>
> -chris
> Christopher W. Milner, Ph.D.
> Science Applications International Corporation
> 675 Peter Jefferson Pkwy.
> Suite 300
> Charlottesville, VA 22911
> 434-872-8517 (Office)
>
> Adam Lally wrote:
>
> Hi,
>
> In our last telecon we agreed the Abstract Interfaces open issues
> should undergo further discussion. Let's see if we can get some
> discussion going before the next call. Here's my summary of
what we
> discussed last time:
>
> 1) Analyzer Interface: should it be able to process mutliple CASes
> in one call?
>
> We dicsussed that there are two reasons why we might want to allow
> this. First there is a performance argument: in particular
for
> remote services, it may be inefficient to send each document as a
> separate request. Secondly there is the argument that there
might
> be an Analytic that needs to see a set of related CASes in order
to
> make a decision about how to annotate them.
>
> I think we were in agreement that we at least need to support
> sending multiple CASes for the performance reasons. Possibly
this
> can be pushed down to the concrete (SOAP, Java) bindings.
>
> The idea of an Analytic operating on a set of related CASes raises
> more questions. Do we then need a way to declare this in the
> Analytic's Behavioral Metadata? This puts a burden on the caller
> of figuring out what a valid set of CASes is for this Analytic,
> otherwise it will not function properly. Also this approach
does
> not scale well - if the number of CASes in this logical set is
> large, we may not be able to actually send them all in one call.
>
> We noted that "CAS Consumer" Analytics, which consider a
set of
> CASes in order to update some aggregate data structure, do not
need
> to have all of the CASes passed to them in one call. They can
see
> them one at a time and keep state across process calls. So a
> logical set of CASes needs to be passed only when the results of
the
> analysis are written back to those same CASes. Even this case
could
> be addressed with a two-pass flow: The FlowController could
send
> each CAS through the Analytic once allowing it to compile
aggregate
> statistics, and then send each CAS through again to allow the
> Analytic to add annotations.
>
>
> Below are the other issues in my summary that we did not get a
> chance to discuss on the call. Comments appreciated.
>
> 2) [Box on pg. 62] Does the CAS Multiplier interface need any/all
of
> the following capabilities:
> a) Return more than one CAS at a time
> b) Return an indication that no more CASes
are available
> now, but that the caller should try back later. (The caller may
> specify the amount of time to wait before returning.)
> c) Return an estimate of how many CASes
have not yet been
> retrieved by the caller.
> 3) [Box on pg. 64] Flow Controller Interface:
> a) Should it be allowed to modify the
CAS? (Currently
> whitepaper doesn't allow it, but Apache UIMA implementation does.)
> b) Should the FlowController interface
be kept simple (as in
> the UML diagram in figure 12) or be more like the Apache UIMA
> interface, or somewhere in between?
>
> At the meta-level, to what degree do these need to be specified in
> the Abstract Interfaces section, and what amount of flexibility do
> we leave to specific bindings (concerete interfaces)? This gets
to
> the core question of the what exactly it means for an
implementation
> to comply with the Abstract Interfaces section.
>
> Regards,
> -Adam
> _____________________________
> Adam Lally
> Advisory Software Engineer
> UIMA Framework Lead Developer
> IBM T.J. Watson Research Center
> Hawthorne, NY, 10532
> Tel: 914-784-7706, T/L: 863-7706
> alally@us.ibm.com
--
Christopher W. Milner, Ph.D.
Science Applications International Corporation
675 Peter Jefferson Pkwy.
Suite 300
Charlottesville, VA 22911
434-872-8517 (Office)
|