uima message

Subject: Fw: [uima] Abstract Interfaces Open Issues

From: Adam Lally <alally@us.ibm.com>
To: uima@lists.oasis-open.org
Date: Fri, 30 Mar 2007 11:09:48 -0400

I just realized this was not sent to the list. Sorry.
_____________________________
Adam Lally
Advisory Software Engineer
UIMA Framework Lead Developer
IBM T.J. Watson Research Center
Hawthorne, NY, 10532
Tel: 914-784-7706, T/L: 863-7706
alally@us.ibm.com
----- Forwarded by Adam Lally/Watson/IBM on 03/30/2007 11:09 AM -----

Adam Lally/Watson/IBM

03/29/2007 07:03 PM

To	"Christopher W. Milner" <milnerc@pf-cvl.net>
cc
Subject	Re: [uima] Abstract Interfaces Open IssuesLink

Hi Chris, thanks for responding.

> Question 1) I think Adam captured the conversation. I am still > inclined to think the ability to process multiple CASes in one call > is valuable. I am a bit new to the paradigm and found myself doing > the following (in GATE): I had one PE that created an external (and > expensive) object, processed the entire collection of Documents > (CASes) using the object and then closed the object. I am not even > sure the API allowed referencing the object across calls to the PE > (outside of statics, factories and kludgery). It was all quite easy > to do since I could operate on a collection of CASes. > Not will to fall on my sword about it. Perhaps the UIMA/GATE > interoperability effort provides some insight?

In this example, are you actually holding the entire collection of Documents in memory at once? My concern is that this wouldn't scale. Can the component receive the Documents (CASes) one at a time and feed them to the external object? In UIMA an Analytic can definitely create an extrenal object (resource) and hold a reference to it across process calls. "CAS Consumer" components such as indexers typically do something like this. They receive the CASes one at a time, and perform whatever operation is necessary on a CAS (e.g., extracting the contents and adding them to the index) before receiving the next CAS. Since the index can be flushed to disk the component doesn't need to have all the CASes avaialble at once, which wouldn't scale to large collections.

I'm OK with budling CASes where its purely a performance optimization, and is a deployment-dependent decision (based on network bandwith, available memory, etc., of a particular configuration). The key is that the Analytic will produce the same results regardless of how many CASes it gets in one bundle. I'm more troubled with this if bundling becomes a precondition on the Analytic, so that a certain number of CASes (perhaps related in some way) must be delievered to the Analytic in one bundle or else it will not work (or will work poorly). That seems like it opens a can of worms.

> Question 2, a,b and c: w.r.t. part 1,I am inclined to urge for it. > I've seen the overhead of passing small pieces of info (small CASes) > up and down various network-related stacks, along with various > attending negotiations for licenses, Session passing, etc. and would > sorely miss the ability to bundle up smaller CASes, where needed. > > There is also the question of consistency: if I be for multiple CAS > on the input side, then I should be so on the output side > (especially for a pipeline).

Agreed, for performance reasons I think we should allow bundling CASes together into one call, and it makes sense to do that both for request and response. The next question is whether this belongs in the Abstract Interfaces section of the spec or gets pushed down into the Concrete (e.g., SOAP) interface.

> Parts 2b and c seem essential for flavors of asynchronous > processing: I'm inclined to vote for it. >

Seems reasonable. Again I wonder how much belongs in the Abstract Interfaces and how much in the Concrete Interfaces. That higher level question is something I'm struggling with.

> Question 3a) I am not sure of the value. I see on page 83 a > discussion of mapping between type systems and (perhaps) using the > flow controller to carry out this mapping. But there seems to be an > equally reasonable mechanism using Analytics to carry out this mapping. > > I think the bigger issue relates to some future ability to apply > transformations to the entire pipeline. Does permitting the > flowcontroller to modify the CAS bollux up some analysis that might > "automatically" optimize or compose PEs based on their pre- > conditions, capabilities and post-conditions? Does it render some > form of dataflow or consistency analysis impossible that might have > let to parallelization of this work? I am not sure. >
Very good point. If FlowControllers modify the CAS they would have to be considered as part of any such analysis. The FlowController would have to declare its preconditions, capabilities, and postconditions, and these would have to be taken into account. I'm not sure it renders anything impossible but it certainly does make it more complicated.

Regards,
-Adam
_____________________________
Adam Lally
Advisory Software Engineer
UIMA Framework Lead Developer
IBM T.J. Watson Research Center
Hawthorne, NY, 10532
Tel: 914-784-7706, T/L: 863-7706
alally@us.ibm.com

"Christopher W. Milner" <milnerc@pf-cvl.net> wrote on 03/29/2007 01:15:21 AM: > I am not sure of the current convention on commenting on email > (interleaved, or simply written at the top) so I'll just write at > the top and will accept constructive comments if there is a "better" way. > > Question 1) I think Adam captured the conversation. I am still > inclined to think the ability to process multiple CASes in one call > is valuable. I am a bit new to the paradigm and found myself doing > the following (in GATE): I had one PE that created an external (and > expensive) object, processed the entire collection of Documents > (CASes) using the object and then closed the object. I am not even > sure the API allowed referencing the object across calls to the PE > (outside of statics, factories and kludgery). It was all quite easy > to do since I could operate on a collection of CASes. > Not will to fall on my sword about it. Perhaps the UIMA/GATE > interoperability effort provides some insight? > > Question 2, a,b and c: w.r.t. part 1,I am inclined to urge for it. > I've seen the overhead of passing small pieces of info (small CASes) > up and down various network-related stacks, along with various > attending negotiations for licenses, Session passing, etc. and would > sorely miss the ability to bundle up smaller CASes, where needed. > > There is also the question of consistency: if I be for multiple CAS > on the input side, then I should be so on the output side > (especially for a pipeline). > > But this may be my exposure to GATE speaking here. > > Parts 2b and c seem essential for flavors of asynchronous > processing: I'm inclined to vote for it. > > Question 3a) I am not sure of the value. I see on page 83 a > discussion of mapping between type systems and (perhaps) using the > flow controller to carry out this mapping. But there seems to be an > equally reasonable mechanism using Analytics to carry out this mapping. > > I think the bigger issue relates to some future ability to apply > transformations to the entire pipeline. Does permitting the > flowcontroller to modify the CAS bollux up some analysis that might > "automatically" optimize or compose PEs based on their pre- > conditions, capabilities and post-conditions? Does it render some > form of dataflow or consistency analysis impossible that might have > let to parallelization of this work? I am not sure. > > Question 3b: not sure. > > -chris > Christopher W. Milner, Ph.D. > Science Applications International Corporation > 675 Peter Jefferson Pkwy. > Suite 300 > Charlottesville, VA 22911 > 434-872-8517 (Office) > > Adam Lally wrote:
> > Hi, > > In our last telecon we agreed the Abstract Interfaces open issues > should undergo further discussion. Let's see if we can get some > discussion going before the next call. Here's my summary of what we > discussed last time: > > 1) Analyzer Interface: should it be able to process mutliple CASes > in one call? > > We dicsussed that there are two reasons why we might want to allow > this. First there is a performance argument: in particular for > remote services, it may be inefficient to send each document as a > separate request. Secondly there is the argument that there might > be an Analytic that needs to see a set of related CASes in order to > make a decision about how to annotate them. > > I think we were in agreement that we at least need to support > sending multiple CASes for the performance reasons. Possibly this > can be pushed down to the concrete (SOAP, Java) bindings. > > The idea of an Analytic operating on a set of related CASes raises > more questions. Do we then need a way to declare this in the > Analytic's Behavioral Metadata? This puts a burden on the caller > of figuring out what a valid set of CASes is for this Analytic, > otherwise it will not function properly. Also this approach does > not scale well - if the number of CASes in this logical set is > large, we may not be able to actually send them all in one call. > > We noted that "CAS Consumer" Analytics, which consider a set of > CASes in order to update some aggregate data structure, do not need > to have all of the CASes passed to them in one call. They can see > them one at a time and keep state across process calls. So a > logical set of CASes needs to be passed only when the results of the > analysis are written back to those same CASes. Even this case could > be addressed with a two-pass flow: The FlowController could send > each CAS through the Analytic once allowing it to compile aggregate > statistics, and then send each CAS through again to allow the > Analytic to add annotations. > > > Below are the other issues in my summary that we did not get a > chance to discuss on the call. Comments appreciated. > > 2) [Box on pg. 62] Does the CAS Multiplier interface need any/all of > the following capabilities: > a) Return more than one CAS at a time > b) Return an indication that no more CASes are available > now, but that the caller should try back later. (The caller may > specify the amount of time to wait before returning.) > c) Return an estimate of how many CASes have not yet been > retrieved by the caller. > 3) [Box on pg. 64] Flow Controller Interface: > a) Should it be allowed to modify the CAS? (Currently > whitepaper doesn't allow it, but Apache UIMA implementation does.) > b) Should the FlowController interface be kept simple (as in > the UML diagram in figure 12) or be more like the Apache UIMA > interface, or somewhere in between? > > At the meta-level, to what degree do these need to be specified in > the Abstract Interfaces section, and what amount of flexibility do > we leave to specific bindings (concerete interfaces)? This gets to > the core question of the what exactly it means for an implementation > to comply with the Abstract Interfaces section. > > Regards, > -Adam > _____________________________ > Adam Lally > Advisory Software Engineer > UIMA Framework Lead Developer > IBM T.J. Watson Research Center > Hawthorne, NY, 10532 > Tel: 914-784-7706, T/L: 863-7706 > alally@us.ibm.com