uima message

Subject: OASIS Call for Participation: OASIS Unstructured Information Management Architecture (UIMA) TC
From: "Mary McRae" <mary.mcrae@oasis-open.org>
To: <members@lists.oasis-open.org>,<tc-announce@lists.oasis-open.org>
Date: Sat, 28 Oct 2006 21:09:35 -0400
To:  OASIS members & interested parties

   A new OASIS technical committee is being formed. The OASIS Unstructured
Information Management Architecture Technical Committee has been proposed by the
members of OASIS listed below.  The proposal, below, meets the requirements of
the OASIS TC Process [a]. The TC name, statement of purpose, scope, list of
deliverables, audience, and language specified in the proposal will constitute
the TC's official charter. Submissions of technology for consideration by the
TC, and the beginning of technical discussions, may occur no sooner than the
TC's first meeting.

   This TC will operate under our 2005 IPR Policy [b]. The eligibility
requirements for becoming a participant in the TC at the first meeting (see
details below) are that:

   (a) you must be an employee of an OASIS member organization or an individual
member of OASIS;
   (b) the OASIS member must sign the OASIS membership agreement [c];
   (c) you must notify the TC chair of your intent to participate at least 15
days prior to the first meeting, which members may do by using the "Join this
TC" button on the TC's public page at [d]; and
   (d) you must attend the first meeting of the TC, at the time and date fixed
below.

Of course, participants also may join the TC at a later time. OASIS and the TC
welcomes all interested parties.

   Non-OASIS members who wish to participate may contact us about joining OASIS
[c]. In addition, the public may access the information resources maintained for
each TC: a mail list archive, document repository and public comments facility,
which will be linked from the TC's public home page at [d].

   Please feel free to forward this announcement to any other appropriate lists.
OASIS is an open standards organization; we encourage your feedback.

Regards,

Mary

---------------------------------------------------
Mary P McRae
Manager of TC Administration, OASIS
email: mary.mcrae@oasis-open.org  
web: www.oasis-open.org 

[a] http://www.oasis-open.org/committees/process.php
[b] http://www.oasis-open.org/who/intellectualproperty.php  
[c] See http://www.oasis-open.org/join/ 
[d] http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=uima

CALL FOR PARTICIPATION
OASIS UNSTRUCTURED INFORMATION MANAGEMENT ARCHITECTURE (UIMA) TC


===========================================================
1. The name of the TC, such name not to have been previously used for an OASIS
TC and not to include any trademarks or service marks not owned by OASIS. The
proposed TC name is subject to TC Administrator approval and may not include any
misleading or inappropriate names. The proposed name must specify any acronyms
or abbreviations of the name that shall be used to refer to the TC.
===========================================================

OASIS Unstructured Information Management Architecture (UIMA) Technical
Committee

===========================================================
2. A statement of purpose, including a definition of the problem to be solved.
===========================================================

Unstructured Information and UIM Applications
---------------------------------------------
Unstructured information represents the largest, most current and fastest
growing source of knowledge available to businesses and governments worldwide.
The web is just the tip of the iceberg. 
Consider the droves of corporate and technical documentation ranging from best
practices, research reports, medical abstracts, problem reports, customer
communications and contracts to emails and voice mails. Beyond these consider
the growing number of broadcasts containing audio, video and speech. In these
mounds of natural language, speech and video artifacts often lay nuggets of
knowledge critical for analyzing and solving problems, detecting threats,
realizing important trends and relationships, creating new opportunities or
preventing disasters. 

- Shaving off just seconds per call to find the right technical documentation in
call-centers can save millions of dollars.

- Rapidly detecting emerging trends in problem-reports coming in from all over
the globe can avoid recalls and save companies and their customers millions if
not billions.

- Analyzing SEC reports to help evaluate corporate financial positions.

- Automating the analysis, segmentation and restructuring of educational content
to better serve changing skill sets or new learning objectives can save many
hours and can better enable just-in-time learning for critical tasks.

- Detecting otherwise unrealized drug interactions through analyzing the
linkages buried in millions of medical abstracts can help prevent disaster as
well as help discover new drugs or cures.

- Analyzing communications linked to terrorist networks in the form of
multi-lingual text, speech or video can help uncover plots threatening national
security before they happen.

These are just a few of the applications that can benefit from the exploitation
of unstructured information.

Applications like these, which rely on the rapid discovery of vital knowledge,
require the analysis of unstructured information. This is all the information
that has NOT been carefully encoded in enterprise databases but rather exists as
natural language text, speech or video.  These applications rely on the rapid
assignment of semantics to huge volumes of unstructured content exactly so that
this content may be structured and exploited by traditional application
infrastructure (e.g., database management systems, knowledgebase systems,
information retrieval systems, etc.).

Unstructured information may be defined as the direct product of human
communication. Examples include natural language documents, email, speech,
images and video. It is information that was not specifically encoded for
machines to process but rather authored by humans for humans to understand. We
say it is "unstructured" 
because it lacks explicit semantics ("structure") required for applications to
interpret the information as intended by the human author or required by the
end-user application.  

Unstructured information may be contrasted with the information in classic
relational databases where the intended interpretation for every field data is
explicitly encoded in the database by column headings. Consider information
encoded in XML as another example.  In an XML document some of the data is
wrapped by tags which provide explicit semantic information about how that data
should be interpreted. An XML document or a relational database may be
considered semi-structured in practice, because the content of some chunk of
data, a blob of text in a text field labeled "description" for example, may be
of interest to an application but remain without any explicit tagging-that is,
without any explicit semantics or structure.

For unstructured information to be processed by traditional applications, it
must be first analyzed to assign application- specific semantics to the
unstructured content. Another way to say this is that the unstructured
information must become "structured" where the added structure explicitly
provides the semantics required by target applications to interpret the data. 

An example of assigning semantics includes wrapping regions of text in a text
document with appropriate XML tags that might identify the names of
organizations or products. Another example may extract elements of a document
and insert them in the appropriate fields of a relational database or use them
to create instances of concepts in a knowledgebase. Another example may analyze
a voice stream and tag it with the information explicitly identifying the
speaker.

A simple analysis on documents may, for example, scan each token in each
document of a collection to identify names of organizations. It may insert a tag
wrapping and identifying every found occurrence of an organization name and
output the XML that explicitly annotates each with the appropriate tag. An
application that manages a database of organizations may now use the structured
information produced by the document analysis to populate a relational database.

In general, we refer to the act of assigning semantics to a region of some
unstructured content (e.g., a document) as 
"analysis".   A software component or service that performs the 
analysis is an "analytic". 

The semantics are captured by an analytic as structure metadata elements. So
*analytics* implement operations that produce structured metadata elements
describing regions of the unstructured content which they analyze. The generated
metadata may be represented in many different ways including as XML tags. 

We refer to systems that perform analysis on unstructured information as
"Unstructured Information Management (UIM) applications." 

UIM applications tend to be highly decomposable; that is, they may be broken
down into many fine-grained *analytics*.  Each of these performs some
constituent function in an overall analysis flow. 

Analytics and Analysis Frameworks
----------------------------------
Analytics may be reused in different flows to perform different aggregate
analyses. Even in our simple example above, a first, very common function, in
the overall process is to tokenize the document (identify each individual word).
This tokenization function may be reused as a first step in many different
analysis tasks for many different applications.

Many software frameworks have been developed in support of building and
integrating component analytics (e.g., Gate, Catalyst, Tipster, Mallet, Talent,
Open-NLP, etc.).  However, no clear standard has emerged for enabling the
interoperability of analytics across modalities (text, audio, video, etc.),
frameworks and programming platforms in support of developing robust and
pluggable UIM applications. 

The UIMA Java Framework is an implementation that arguably comes closest to
addressing the breadth of these requirements. It was originally developed as
part of the UIMA project at IBM Research (http://www.ibm.com/research/uima). It
provides a common, object- oriented and extensible means for representing
unstructured information and its metadata, a set of basic interface definitions
for implementing interoperable analytics and a Java run-time for supporting
analytic composition and deployment (of Java and C++ analytics).

The UIMA Java Framework was released in late 2004 as part of the UIMA Software
Developers Kit (SDK) on IBM AlphaWorks
(http://www.ibm.com/alphaworks/tech/uima). The SDK is freely available and
provides the tools and run-time necessary for creating, composing and deploying
component analytics. These may be implemented by the developer to analyze and
assign semantics to multi-modal data including, for example, combinations of
text, audio and video. 

In early 2006 IBM contributed the UIMA Java Framework to the open-source
community through source forge (http://uima- framework.sourceforge.net/). The
open-source will soon be managed in a venue where IBM and non-IBM committers can
participate in its collaborative development.  Since the framework's posting,
there have been over 8000 downloads of the framework by industry, government and
academia. It has been included in IBM Information Management products and used
in many solutions in areas ranging from life-sciences, to national security to
customer relationships management.

The Need for a Standard Specification
--------------------------------------
The UIMA Java Framework is an implementation tied to a particular programming
model and platform. It makes many system level commitments based on a variety of
design points. This implementation, however, suggests a more general
specification for interoperability that may allow for different framework
implementations and different levels of compliance supporting interoperability
for a broader range of application and programming requirements. 

We propose to develop the UIMA Specification to explicitly define standard data
specifications, operation types and communication protocols to facilitate
interoperability of analytics at the data and services level. 

This level of specification will serve a critical role in helping to facilitate
lighter-weight interoperability across a broader spectrum of platforms,
programming models, applications and tools for text and multi-modal analytics. 

The intent is that the standard will allow different frameworks to emerge, while
also allowing applications built on different implementations to have a standard
means to share analysis data and services. It will lower the barrier for
component and application developers to interoperate at different levels
allowing a broader community to discover, reuse and compose a growing body of
text and multi-modal analytics.

===========================================================
3. The scope of the work of the TC, which must be germane to the mission of
OASIS, and which includes a definition of what is and what is not the work of
the TC, and how it can be determined when the work of the TC has been completed.
The scope may reference a specific contribution of existing work as a starting
point, but other contributions may be made by Members on or after the first
meeting of the TC. Such other contributions shall be considered by the members
of the TC on an equal basis to improve the original starting point contribution.
===========================================================

The scope of the work of the TC is to generalize from the published UIMA Java
Framework implementation and produce a platform-independent specification in
support of the interoperability, discovery and composition of analytics across
modalities, domain models, frameworks and platforms. 

Specifically, the TC is to consider an initial draft contributed by IBM in the
Research Report based on the UIMA project entitled "Towards an Interoperability
Standard for Text and Multi-Modal Analytics". This report should be used as a
straw man to scope, develop and rationalize a formal UIMA specification.

The TC will address three primary tasks

	1. Elements of the Specification
	2. Related Issues and Standards
	3. Higher-Level Documentation

Elements of the Specification
------------------------------
The committee will be charged with evaluating, extending, modifying and refining
the proposed eight (8) elements of the UIMA specification. These elements are
dependent on other standards including UML, eMOF, eCore, XML Schema, XMI, OCL,
WSDL and SOAP.

1. Common Analysis Structure (CAS) Specification. Provides a simple and
extensible typed model for representing analysis data as a standard object model
that may be easily instantiated and manipulated in object-oriented programming
systems.  This element of the specification is provided as a UML model. We
propose adopting the XML Metadata Interchange (XMI) specification
(http://www.omg.org/docs/formal/03-05-02.pdf ) to provide a standard means for
representing analysis data as an XML document.

2. Type System Language Specification. Provides a standard means for associating
object model semantics with artifact metadata that complies with object modeling
standards. We propose to use Ecore as the Type System language. Ecore is the
modeling language used in the Eclipse Modeling Framework and is tightly aligned
with the OMG's EMOF standard
(http://dev.eclipse.org/viewcvs/indextools.cgi/*checkout*/org.ecl
ipse.emf/doc/org.eclipse.emf.doc/references/overview/EMF.html ). 
An Ecore Type System is represented as an XMI document to support the XML-based
representation and interchange of Type Systems.

3. Type System Base Model. Provides a standard and extensible set of
domain-independent types generally useful for analyzing unstructured
information.

4. The Behavioral Metadata Specification. Provides a standard declarative means
for describing the capabilities of analysis operations in terms of what types of
CASs they can process, what elements in a CAS they can analyze, and what sorts
of effects they would have on CAS contents as a result.  Behavioral metadata
would be used to assist in the discovery and composition of analytics based on
their described function. We propose appealing to the OCL standard
(http://www.omg.org/technology/documents/formal/ocl.htm )to represent behavioral
metadata.

5. Analytic Metadata Specification. Provides a standard declarative means for
describing identification, configuration and behavioral information about
analytics. This specification may be represented as a UML Model from which an
XML Schema may be generated. It refers to the Behavioral Metadata Specification
to represent an analytic's behavioral information.

6. Aggregate Analytic Metadata Specification. Provides a standard declarative
means for an aggregate analytic to:  a. refer to its constituent analytics, b.
identify a flow controller, which determines the order in which the constituent
analytics of the aggregate are invoked on a CAS and  c. define mappings to
facilitate the composition of independently-developed analytics.

7. Abstract Interfaces.  Abstractly describes the interfaces to the two
different types of components or services that developers may implement, namely,
Analytics and Flow Controllers. These abstract interfaces may be specified with
a UML model.

8. Service Descriptions and SOAP Bindings.  Provide a standard means for
implementing Analytics and Flow Controllers as web services using SOAP. This
specification may be represented using WSDL (http://www.w3.org/TR/wsdl20/ ).

Related Issues, Requirements and Standards
------------------------------------------
In addition,  the UIMA TC will be charged with providing recommendations
regarding how other requirements should or should NOT  be addressed  or related
to by the UIMA specification
including:

1. CAS representations for efficient stream operations 2. Representing and
Recording Provenance Information 3. Privacy and Security Issues 4. General
alignment with ontologies and related representational 
	standards including OWL and RDF
5. Facilities for mapping between metadata models (e.g., XSLT) 6. Support for
existing metadata models and their representations 
	(VoiceXML, LegalXML, MPEG-7, etc.)
7. Componentization, life-cycle management and related standards 
	(e.g., OSGi)
8. Discovery-services in support of finding analytics based on 
	identification and behavioral metadata
9. Analytic configuration management

High-Level Documentation
-------------------------
The UIMA TC should produce higher-level documentation to help motivate and
promote the UIMA specification as a standard that may include use-cases,
case-studies and high-level architectural descriptions but excludes detailed
formalizations.

Out of Scope
-------------
Finally, the UIMA TC will NOT address platform-dependent specifications
including the definition programming models or object-oriented APIs, the binding
of interfaces to any particular programming language, workflow engines or
languages, the implementation or integration of system middleware services to
address the scalability, componentization or life-cycle management of framework
implementations. The UIMA TC would NOT define any specific domain model (e.g.,
set of XML tags or types) for marking up unstructured information.

===========================================================
4. A list of deliverables, with projected completion dates.
===========================================================

1. Initial Use Cases...............................2Q 2007 
2. The CAS Model...................................3Q 2007 
3. The CAS XMI Specification.......................3Q 2007 
4. The Type System Language........................3Q 2007 
5. The Type System Base Model......................3Q 2007 
6. Behavioral Metadata ............................4Q 2007 
7. Analytic Metadata ..............................4Q 2007 
8. Aggregate Analytic Metadata.....................4Q 2007 
9. Abstract Interfaces.............................4Q 2007 
10. Service WSDL Descriptions......................4Q 2007 
11. Recommendations regarding related 
	requirements..................................4Q 2007 
12. Appendix: Soap Bindings........................4Q 2007 
13. Appendix: Java Framework Compliance Notes......4Q 2007 
14. Appendix: Design Patterns......................4Q 2007

===========================================================
5. Specification of the IPR Mode under which the TC will operate.
===========================================================

RF on Limited Terms

===========================================================
6. The anticipated audience or users of the work.
===========================================================

1. UIMA Java Framework developers
2. Text Analysis Vendors
3. Search and Knowledge Discovery Vendors 
4. Document Management Vendors 
5. Video and Speech Analysis Vendors 
6. Machine Translation Vendors 
7. Government Contractors 
8. US and other Government agencies 
9. R&D in Life-Sciences and BioInformatics 
10. Universities performing research in text & multi-modal 
	analytics
11. Publishing 

===========================================================
7. The language in which the TC shall conduct business.
===========================================================

English

===========================================================
Non-normative information regarding the startup of the TC

===========================================================
1. Identification of similar or applicable work that is being done in other
OASIS TCs or by other organizations, why there is a need for another effort in
this area and how this proposed TC will be different, and what level of liaison
will be pursued with these other organizations.
===========================================================
Domain-Model Independence and Stand-off Annotations
----------------------------------------------------
We refer to *analytics* as operations that analyze unstructured content to
produce structured metadata elements intended to describe regions of the
unstructured content.  

The UIMA Specification is focused on supporting interoperability across analytic
implementations. That is, on facilitating the analytic developers to discover,
reuse and compose each other's analytics in their applications.

Essential to the UIMA Specification is its independence of any particular
domain-level data model that may describe some set of annotation types.  These
types vary widely and cover a potentially infinite space of concepts and
relationships. Domain- level models may for example include  "persons",
"places", and "things" or "noun phrases" and "verb phrases" or "events",
"opinions", "sentiments" and "temporal relations" or "chemical names" and
"chemical reactions," etc. 

The UIMA Specification therefore proposes a general and expressive underlying
representation scheme based on object modeling standards and represents
annotations as "stand-off" 
labels over regions of the unstructured content. Regions may, of course, include
entire documents, segments or even collections thereof.

"Stand-off" means that metadata elements that label regions of content are
represented as objects in an object model that "point into" or reference the
unstructured content (e.g., document or video stream) rather than "insert" some
type of tag or marker directly into the content changing its original form. In
UIMA the original content is not affected in the analysis process.  
Rather, an object graph is produced that stands off from and annotates the
content. Stand-off annotations allow for multiple content interpretations of
graph complexity to be produced, co- exist, overlap and be retracted without
affecting the original content representation. 

The object model representing the stand-off annotations may be used to produce
different representations of the analysis results. A common form for capturing
document metadata for example is as in-line XML.  An analytic in a UIM
application, for example, can generate from the UIMA representation an in-line
XML document that conforms to some particular domain model or markup language.

XML tag sets and other data models
-----------------------------------
The UIMA specification is NOT focused on proposing any particular data model
that a set of analytics may use to implement their metadata. Domain-level tag
sets, markup languages or data models are all orthogonal and complementary to
the UIMA specification. 
For example, there are multiple efforts to produce XML standards around text,
voice and video analysis (e.g., VoiceXML, MPEG-7, LegalXML). These efforts
address the definition of specific models for annotating unstructured
information that may be represented as UIMA Type Systems using UML for example. 

UIMA is intended to address a standard focused on supporting the development,
discovery and composition of analysis operations that may process combinations
of all sorts of unstructured information independently of the type of content or
the metadata model that may describe it.

UIMA focuses on how to characterize data and operations that describe and act on
the unstructured content so that they may be discovered and composed to
efficiently perform aggregate analysis tasks. The output of any analysis process
on unstructured information may be ultimately mapped to any one of these
modality and/or domain specific XML-based standards.

Document Structure and Operations
-----------------------------------
The UIMA specification is independent of any particular model for representing
or manipulating some chunk of unstructured content. This differs from UOML
(http://www.oasis-open.org/events/symposium_2006/slides/Wang.pdf) for example,
which proposes an abstract structure for document text and for a basic set of
operations for manipulating that defined structure (e.g., get, set, insert,
delete). UIMA on the other hand proposes a common way for analytic developers to
describe and exchange models of unstructured content and to describe the
behavior of the analytics that inspect and produce metadata about that
unstructured content. The content itself may be physically represented in any
number of ways.
 
===========================================================
2. Optionally, a list of contributions of existing technical work that the
proposers anticipate will be made to this TC.
===========================================================
1. IBM Research Report based on the UIMA project entitled "Towards an
Interoperability Standard for Text and Multi-Modal Analytics". This report
represents a straw man specification that embodies consideration of several
high-level use cases and numerous projects inside and outside of IBM focused on
developing frameworks and applications for text and multi-modal analytics. 
It highlights how the proposed ideas relate to current implementations and
raises a number of discussion points and open issues around developing such a
standard.

2. The UIMA Java Software development Kit (SDK) User Manual
http://dl.alphaworks.ibm.com/technologies/uima/UIMA_SDK_Users_Guide_Reference_2.
0.pdf 

===========================================================
3. Optionally, a proposed working title and acronym for the
specification(s) to be developed by the TC.
===========================================================

UIMA Specification

===========================================================
4. The date, time, and location of the first meeting, whether it will be held in
person or by phone, and who will sponsor this first meeting. The first meeting
of a TC shall occur no less than 30 days after the announcement of its formation
in the case of a telephone or other electronic meeting, and no less than 45 days
after the announcement of its formation in the case of a face-to- face meeting.
===========================================================

Date: December 6, 2006
Time: 10:00 AM Eastern
Duration: 2 Hours
Mode: Teleconference
Number: TBD
Sponsor: IBM

===========================================================
5. The projected on-going meeting schedule for the year following the formation
of the TC, or until the projected date of the final deliverable, whichever comes
first, and who will be expected to sponsor these meetings.
===========================================================

Bi-weekly 90 Minute Teleconferences sponsored by IBM.

===========================================================
6. The names, electronic mail addresses, and membership affiliations of at least
Minimum Membership who support this proposal and are committed to the Charter
and projected meeting schedule
===========================================================

In alphabetical order by last name:

1. Sophia Ananiadou, Sophia.ananiadou@manchester.ac.uk, University of
Manchester, UK
2. Christopher G. Chute, chute@mayo.edu, Mayo Clinic College of Medicine 
3. Pascal Coupet, pascal.coupet@temis.com 
4. Hamish Cunningham, hamish@dcs.shef.ac.uk, University of Sheffield 
5. David Ferrucci, ferrucci@us.ibm.com, IBM 
6. Thomas Hampp, thomas.hampp@de.ibm.com, IBM 
7. Jonathan D. Michel, Jonathan.D.Michel@saic.com, SAIC 
8. Carl Madson, carl.madson@sri.com, SRI 
9. Tim Miller, Tim.Miller@thomson.com, Thomson 
10. Eric Nyberg, ehn@cs.cmu.edu, Carnegie Mellon University 
11. Laurent Proulx, laurent.proulx@nstein.com Nstein Technologies 
12. Alex Rankov, Rankov_Alex@emc.com, EMC 
13. Antonio Sanfilippo, antonio.sanfilippo@pnl.gov 
14. Junichi Tsuhii, J.Tsujii@manchester.ac.uk, School of Computer Science,
University of Manchester, UK 
15. Suzi Yoakum-Stover, suzanne.yoakum-stover@saic.com, Army Intelligence and
Information Warfare Directorate

===========================================================
7. The name of the Convener who must be an Eligible Person.
===========================================================

Convener: David A. Ferrucci