tc-announce message

Subject: OASIS Call for Participation: OASIS Unstructured Information Management Architecture (UIMA) TC
From: "Mary McRae" <mary.mcrae@oasis-open.org>
To: <members@lists.oasis-open.org>,<tc-announce@lists.oasis-open.org>
Date: Sat, 28 Oct 2006 21:26:35 -0400
To:  OASIS members & interested parties

   A new OASIS technical committee is being formed. The OASIS Unstructured
Information Management Architecture Technical Committee has been proposed by the
members of OASIS listed below.  The proposal, below, meets the requirements of
the OASIS TC Process [a]. The TC name, statement of purpose, scope, list of
deliverables, audience, and language specified in the proposal will constitute
the TC's official charter. Submissions of technology for consideration by the
TC, and the beginning of technical discussions, may occur no sooner than the
TC's first meeting.

   This TC will operate under our 2005 IPR Policy [b]. The eligibility
requirements for becoming a participant in the TC at the first meeting (see
details below) are that:

   (a) you must be an employee of an OASIS member organization or an individual
member of OASIS;
   (b) the OASIS member must sign the OASIS membership agreement [c];
   (c) you must notify the TC chair of your intent to participate at least 15
days prior to the first meeting, which members may do by using the "Join this
TC" button on the TC's public page at [d]; and
   (d) you must attend the first meeting of the TC, at the time and date fixed
below.

Of course, participants also may join the TC at a later time. OASIS and the TC
welcomes all interested parties.

   Non-OASIS members who wish to participate may contact us about joining OASIS
[c]. In addition, the public may access the information resources maintained for
each TC: a mail list archive, document repository and public comments facility,
which will be linked from the TC's public home page at [d].

   Please feel free to forward this announcement to any other appropriate lists.
OASIS is an open standards organization; we encourage your feedback.

Regards,

Mary

---------------------------------------------------
Mary P McRae
Manager of TC Administration, OASIS
email: mary.mcrae@oasis-open.org
web: www.oasis-open.org 

[a] http://www.oasis-open.org/committees/process.php
[b] http://www.oasis-open.org/who/intellectualproperty.php
[c] See http://www.oasis-open.org/join/ 
[d] http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=uima

CALL FOR PARTICIPATION
OASIS UNSTRUCTURED INFORMATION MANAGEMENT ARCHITECTURE (UIMA) TC


===========================================================
1. The name of the TC, such name not to have been previously used 
for an OASIS TC and not to include any trademarks or service 
marks not owned by OASIS. The proposed TC name is subject to TC 
Administrator approval and may not include any misleading or 
inappropriate names. The proposed name must specify any acronyms 
or abbreviations of the name that shall be used to refer to the 
TC.
===========================================================

OASIS Unstructured Information Management Architecture (UIMA) 
Technical Committee

===========================================================
2. A statement of purpose, including a definition of the problem 
to be solved.
===========================================================

Unstructured Information and UIM Applications
---------------------------------------------
Unstructured information represents the largest, most current and 
fastest growing source of knowledge available to businesses and 
governments worldwide.  The web is just the tip of the iceberg. 
Consider the droves of corporate and technical documentation 
ranging from best practices, research reports, medical abstracts, 
problem reports, customer communications and contracts to emails 
and voice mails. Beyond these consider the growing number of 
broadcasts containing audio, video and speech. In these mounds of 
natural language, speech and video artifacts often lay nuggets of 
knowledge critical for analyzing and solving problems, detecting 
threats, realizing important trends and relationships, creating 
new opportunities or preventing disasters. 

- Shaving off just seconds per call to find the right technical 
documentation in call-centers can save millions of dollars.

- Rapidly detecting emerging trends in problem-reports coming in 
from all over the globe can avoid recalls and save companies and 
their customers millions if not billions.

- Analyzing SEC reports to help evaluate corporate financial 
positions.

- Automating the analysis, segmentation and restructuring of 
educational content to better serve changing skill sets or new 
learning objectives can save many hours and can better enable 
just-in-time learning for critical tasks.

- Detecting otherwise unrealized drug interactions through 
analyzing the linkages buried in millions of medical abstracts 
can help prevent disaster as well as help discover new drugs or 
cures.

- Analyzing communications linked to terrorist networks in the 
form of multi-lingual text, speech or video can help uncover 
plots threatening national security before they happen.

These are just a few of the applications that can benefit from 
the exploitation of unstructured information.

Applications like these, which rely on the rapid discovery of 
vital knowledge, require the analysis of unstructured 
information. This is all the information that has NOT been 
carefully encoded in enterprise databases but rather exists as 
natural language text, speech or video.  These applications rely 
on the rapid assignment of semantics to huge volumes of 
unstructured content exactly so that this content may be 
structured and exploited by traditional application 
infrastructure (e.g., database management systems, knowledgebase 
systems, information retrieval systems, etc.).

Unstructured information may be defined as the direct product of 
human communication. Examples include natural language documents, 
email, speech, images and video. It is information that was not 
specifically encoded for machines to process but rather authored 
by humans for humans to understand. We say it is "unstructured" 
because it lacks explicit semantics ("structure") required for 
applications to interpret the information as intended by the 
human author or required by the end-user application.  

Unstructured information may be contrasted with the information 
in classic relational databases where the intended interpretation 
for every field data is explicitly encoded in the database by 
column headings. Consider information encoded in XML as another 
example.  In an XML document some of the data is wrapped by tags 
which provide explicit semantic information about how that data 
should be interpreted. An XML document or a relational database 
may be considered semi-structured in practice, because the 
content of some chunk of data, a blob of text in a text field 
labeled "description" for example, may be of interest to an 
application but remain without any explicit tagging-that is, 
without any explicit semantics or structure.

For unstructured information to be processed by traditional 
applications, it must be first analyzed to assign application-
specific semantics to the unstructured content. Another way to 
say this is that the unstructured information must become 
"structured" where the added structure explicitly provides the 
semantics required by target applications to interpret the data. 

An example of assigning semantics includes wrapping regions of 
text in a text document with appropriate XML tags that might 
identify the names of organizations or products. Another example 
may extract elements of a document and insert them in the 
appropriate fields of a relational database or use them to create 
instances of concepts in a knowledgebase. Another example may 
analyze a voice stream and tag it with the information explicitly 
identifying the speaker.

A simple analysis on documents may, for example, scan each token 
in each document of a collection to identify names of 
organizations. It may insert a tag wrapping and identifying every 
found occurrence of an organization name and output the XML that 
explicitly annotates each with the appropriate tag. An 
application that manages a database of organizations may now use 
the structured information produced by the document analysis to 
populate a relational database.

In general, we refer to the act of assigning semantics to a 
region of some unstructured content (e.g., a document) as 
"analysis".   A software component or service that performs the 
analysis is an "analytic". 

The semantics are captured by an analytic as structure metadata 
elements. So *analytics* implement operations that produce 
structured metadata elements describing regions of the 
unstructured content which they analyze. The generated metadata 
may be represented in many different ways including as XML tags. 

We refer to systems that perform analysis on unstructured 
information as "Unstructured Information Management (UIM) 
applications." 

UIM applications tend to be highly decomposable; that is, they 
may be broken down into many fine-grained *analytics*.  Each of 
these performs some constituent function in an overall analysis 
flow. 

Analytics and Analysis Frameworks 
----------------------------------
Analytics may be reused in different flows to perform different 
aggregate analyses. Even in our simple example above, a first, 
very common function, in the overall process is to tokenize the 
document (identify each individual word). This tokenization 
function may be reused as a first step in many different analysis 
tasks for many different applications.

Many software frameworks have been developed in support of 
building and integrating component analytics (e.g., Gate, 
Catalyst, Tipster, Mallet, Talent, Open-NLP, etc.).  However, no 
clear standard has emerged for enabling the interoperability of 
analytics across modalities (text, audio, video, etc.), 
frameworks and programming platforms in support of developing 
robust and pluggable UIM applications. 

The UIMA Java Framework is an implementation that arguably comes 
closest to addressing the breadth of these requirements. It was 
originally developed as part of the UIMA project at IBM Research 
(http://www.ibm.com/research/uima). It provides a common, object-
oriented and extensible means for representing unstructured 
information and its metadata, a set of basic interface 
definitions for implementing interoperable analytics and a Java 
run-time for supporting analytic composition and deployment (of 
Java and C++ analytics).

The UIMA Java Framework was released in late 2004 as part of the 
UIMA Software Developers Kit (SDK) on IBM AlphaWorks 
(http://www.ibm.com/alphaworks/tech/uima). The SDK is freely 
available and provides the tools and run-time necessary for 
creating, composing and deploying component analytics. These may 
be implemented by the developer to analyze and assign semantics 
to multi-modal data including, for example, combinations of text, 
audio and video. 

In early 2006 IBM contributed the UIMA Java Framework to the 
open-source community through source forge
(http://uima-framework.sourceforge.net/). The open-source will soon be managed 
in a venue where IBM and non-IBM committers can participate in 
its collaborative development.  Since the framework's posting, 
there have been over 8000 downloads of the framework by industry, 
government and academia. It has been included in IBM Information 
Management products and used in many solutions in areas ranging 
from life-sciences, to national security to customer 
relationships management.

The Need for a Standard Specification
--------------------------------------
The UIMA Java Framework is an implementation tied to a 
particular programming model and platform. It makes many 
system level commitments based on a variety of design 
points. This implementation, however, suggests a more 
general specification for interoperability that may allow 
for different framework implementations and different 
levels of compliance supporting interoperability for a 
broader range of application and programming requirements. 

We propose to develop the UIMA Specification to explicitly 
define standard data specifications, operation types and 
communication protocols to facilitate interoperability of 
analytics at the data and services level. 

This level of specification will serve a critical role in helping 
to facilitate lighter-weight interoperability across a broader 
spectrum of platforms, programming models, applications and tools 
for text and multi-modal analytics. 

The intent is that the standard will allow different 
frameworks to emerge, while also allowing applications 
built on different implementations to have a standard means 
to share analysis data and services. It will lower the 
barrier for component and application developers to 
interoperate at different levels allowing a broader 
community to discover, reuse and compose a growing body of 
text and multi-modal analytics.

===========================================================
3. The scope of the work of the TC, which must be germane to the 
mission of OASIS, and which includes a definition of what is and 
what is not the work of the TC, and how it can be determined when 
the work of the TC has been completed. The scope may reference a 
specific contribution of existing work as a starting point, but 
other contributions may be made by Members on or after the first 
meeting of the TC. Such other contributions shall be considered 
by the members of the TC on an equal basis to improve the 
original starting point contribution.
===========================================================

The scope of the work of the TC is to generalize from the 
published UIMA Java Framework implementation and produce a 
platform-independent specification in support of the 
interoperability, discovery and composition of analytics across 
modalities, domain models, frameworks and platforms. 

Specifically, the TC is to consider an initial draft contributed 
by IBM in the Research Report based on the UIMA project entitled 
"Towards an Interoperability Standard for Text and Multi-Modal 
Analytics". This report should be used as a straw man to scope, 
develop and rationalize a formal UIMA specification.

The TC will address three primary tasks

	1. Elements of the Specification
	2. Related Issues and Standards
	3. Higher-Level Documentation

Elements of the Specification
------------------------------
The committee will be charged with evaluating, extending, 
modifying and refining the proposed eight (8) elements of the 
UIMA specification. These elements are dependent on other 
standards including UML, eMOF, eCore, XML Schema, XMI, OCL, WSDL 
and SOAP.

1. Common Analysis Structure (CAS) Specification. Provides a 
simple and extensible typed model for representing analysis data 
as a standard object model that may be easily instantiated and 
manipulated in object-oriented programming systems.  This element 
of the specification is provided as a UML model. We propose 
adopting the XML Metadata Interchange (XMI) specification 
(http://www.omg.org/docs/formal/03-05-02.pdf ) to provide a 
standard means for representing analysis data as an XML document.

2. Type System Language Specification. Provides a standard means 
for associating object model semantics with artifact metadata 
that complies with object modeling standards. We propose to use 
Ecore as the Type System language. Ecore is the modeling language 
used in the Eclipse Modeling Framework and is tightly aligned 
with the OMG's EMOF standard 
(http://dev.eclipse.org/viewcvs/indextools.cgi/*checkout*/org.eclipse.emf/doc/or
g.eclipse.emf.doc/references/overview/EMF.html ). 
An Ecore Type System is represented as an XMI document to support 
the XML-based representation and interchange of Type Systems.

3. Type System Base Model. Provides a standard and extensible set 
of domain-independent types generally useful for analyzing 
unstructured information.

4. The Behavioral Metadata Specification. Provides a standard 
declarative means for describing the capabilities of analysis 
operations in terms of what types of CASs they can process, what 
elements in a CAS they can analyze, and what sorts of effects 
they would have on CAS contents as a result.  Behavioral metadata 
would be used to assist in the discovery and composition of 
analytics based on their described function. We propose appealing 
to the OCL standard 
(http://www.omg.org/technology/documents/formal/ocl.htm )to 
represent behavioral metadata.

5. Analytic Metadata Specification. Provides a standard 
declarative means for describing identification, configuration 
and behavioral information about analytics. This specification 
may be represented as a UML Model from which an XML Schema may be 
generated. It refers to the Behavioral Metadata Specification to 
represent an analytic's behavioral information.

6. Aggregate Analytic Metadata Specification. Provides a standard 
declarative means for an aggregate analytic to:  a. refer to its 
constituent analytics, b. identify a flow controller, which 
determines the order in which the constituent analytics of the 
aggregate are invoked on a CAS and  c. define mappings to 
facilitate the composition of independently-developed analytics.

7. Abstract Interfaces.  Abstractly describes the interfaces to 
the two different types of components or services that developers 
may implement, namely, Analytics and Flow Controllers. These 
abstract interfaces may be specified with a UML model.

8. Service Descriptions and SOAP Bindings.  Provide a standard 
means for implementing Analytics and Flow Controllers as web 
services using SOAP. This specification may be represented using 
WSDL (http://www.w3.org/TR/wsdl20/ ).

Related Issues, Requirements and Standards
------------------------------------------
In addition,  the UIMA TC will be charged with providing 
recommendations regarding how other requirements should or should 
NOT  be addressed  or related to by the UIMA specification 
including:

1. CAS representations for efficient stream operations
2. Representing and Recording Provenance Information
3. Privacy and Security Issues
4. General alignment with ontologies and related representational 
	standards including OWL and RDF
5. Facilities for mapping between metadata models (e.g., XSLT)
6. Support for existing metadata models and their representations 
	(VoiceXML, LegalXML, MPEG-7, etc.)
7. Componentization, life-cycle management and related standards 
	(e.g., OSGi)
8. Discovery-services in support of finding analytics based on 
	identification and behavioral metadata
9. Analytic configuration management

High-Level Documentation
-------------------------
The UIMA TC should produce higher-level documentation to help 
motivate and promote the UIMA specification as a standard that 
may include use-cases, case-studies and high-level architectural 
descriptions but excludes detailed formalizations.

Out of Scope
-------------
Finally, the UIMA TC will NOT address platform-dependent 
specifications including the definition programming models or 
object-oriented APIs, the binding of interfaces to any particular 
programming language, workflow engines or languages, the 
implementation or integration of system middleware services to 
address the scalability, componentization or life-cycle 
management of framework implementations. The UIMA TC would NOT 
define any specific domain model (e.g., set of XML tags or types) 
for marking up unstructured information.

===========================================================
4. A list of deliverables, with projected completion dates.
===========================================================

1. Initial Use Cases...............................2Q 2007
2. The CAS Model...................................3Q 2007
3. The CAS XMI Specification.......................3Q 2007
4. The Type System Language........................3Q 2007
5. The Type System Base Model......................3Q 2007
6. Behavioral Metadata ............................4Q 2007
7. Analytic Metadata ..............................4Q 2007
8. Aggregate Analytic Metadata.....................4Q 2007
9. Abstract Interfaces.............................4Q 2007
10. Service WSDL Descriptions......................4Q 2007
11. Recommendations regarding related 
	requirements..................................4Q 2007
12. Appendix: Soap Bindings........................4Q 2007
13. Appendix: Java Framework Compliance Notes......4Q 2007
14. Appendix: Design Patterns......................4Q 2007

===========================================================
5. Specification of the IPR Mode under which the TC will operate.
===========================================================

RF on Limited Terms

===========================================================
6. The anticipated audience or users of the work.
===========================================================

1. UIMA Java Framework developers 
2. Text Analysis Vendors 
3. Search and Knowledge Discovery Vendors 
4. Document Management Vendors
5. Video and Speech Analysis Vendors 
6. Machine Translation Vendors 
7. Government Contractors 
8. US and other Government agencies
9. R&D in Life-Sciences and BioInformatics
10. Universities performing research in text & multi-modal 
	analytics
11. Publishing 

===========================================================
7. The language in which the TC shall conduct business.
===========================================================

English

===========================================================
Non-normative information regarding the startup of the TC

===========================================================
1. Identification of similar or applicable work that is being 
done in other OASIS TCs or by other organizations, why there is a 
need for another effort in this area and how this proposed TC 
will be different, and what level of liaison will be pursued with 
these other organizations.
===========================================================
Domain-Model Independence and Stand-off Annotations
----------------------------------------------------
We refer to *analytics* as operations that analyze unstructured 
content to produce structured metadata elements intended to 
describe regions of the unstructured content.  

The UIMA Specification is focused on supporting interoperability 
across analytic implementations. That is, on facilitating the 
analytic developers to discover, reuse and compose each other's 
analytics in their applications.

Essential to the UIMA Specification is its independence of any 
particular domain-level data model that may describe some set of 
annotation types.  These types vary widely and cover a 
potentially infinite space of concepts and relationships. Domain-
level models may for example include  "persons", "places", and 
"things" or "noun phrases" and "verb phrases" or "events", 
"opinions", "sentiments" and "temporal relations" or "chemical 
names" and "chemical reactions," etc. 

The UIMA Specification therefore proposes a general and 
expressive underlying representation scheme based on object 
modeling standards and represents annotations as "stand-off" 
labels over regions of the unstructured content. Regions may, of 
course, include entire documents, segments or even collections 
thereof.

"Stand-off" means that metadata elements that label regions of 
content are represented as objects in an object model that "point 
into" or reference the unstructured content (e.g., document or 
video stream) rather than "insert" some type of tag or marker 
directly into the content changing its original form. In UIMA the 
original content is not affected in the analysis process.  
Rather, an object graph is produced that stands off from and 
annotates the content. Stand-off annotations allow for multiple 
content interpretations of graph complexity to be produced, co-
exist, overlap and be retracted without affecting the original 
content representation. 

The object model representing the stand-off annotations may be 
used to produce different representations of the analysis 
results. A common form for capturing document metadata for 
example is as in-line XML.  An analytic in a UIM application, for 
example, can generate from the UIMA representation an in-line XML 
document that conforms to some particular domain model or markup 
language.

XML tag sets and other data models
-----------------------------------
The UIMA specification is NOT focused on proposing any particular 
data model that a set of analytics may use to implement their 
metadata. Domain-level tag sets, markup languages or data models 
are all orthogonal and complementary to the UIMA specification. 
For example, there are multiple efforts to produce XML standards 
around text, voice and video analysis (e.g., VoiceXML, MPEG-7, 
LegalXML). These efforts address the definition of specific 
models for annotating unstructured information that may be 
represented as UIMA Type Systems using UML for example. 

UIMA is intended to address a standard focused on supporting the 
development, discovery and composition of analysis operations 
that may process combinations of all sorts of unstructured 
information independently of the type of content or the metadata 
model that may describe it.

UIMA focuses on how to characterize data and operations that 
describe and act on the unstructured content so that they may be 
discovered and composed to efficiently perform aggregate analysis 
tasks. The output of any analysis process on unstructured 
information may be ultimately mapped to any one of these modality 
and/or domain specific XML-based standards.

Document Structure and Operations
-----------------------------------
The UIMA specification is independent of any particular model for 
representing or manipulating some chunk of unstructured content. 
This differs from UOML
(http://www.oasis-open.org/events/symposium_2006/slides/Wang.pdf) for example, 
which proposes an abstract structure for document text and for a 
basic set of operations for manipulating that defined structure 
(e.g., get, set, insert, delete).   UIMA on the other hand 
proposes a common way for analytic developers to describe and 
exchange models of unstructured content and to describe the 
behavior of the analytics that inspect and produce metadata about 
that unstructured content. The content itself may be physically 
represented in any number of ways.
 
===========================================================
2. Optionally, a list of contributions of existing technical work 
that the proposers anticipate will be made to this TC.
===========================================================
1. IBM Research Report based on the UIMA project entitled 
"Towards an Interoperability Standard for Text and Multi-Modal 
Analytics". This report represents a straw man specification that 
embodies consideration of several high-level use cases and 
numerous projects inside and outside of IBM focused on developing 
frameworks and applications for text and multi-modal analytics. 
It highlights how the proposed ideas relate to current 
implementations and raises a number of discussion points and open 
issues around developing such a standard.

2. The UIMA Java Software development Kit (SDK) User Manual 
http://dl.alphaworks.ibm.com/technologies/uima/UIMA_SDK_Users_Guide_Reference_2.
0.pdf 

===========================================================
3. Optionally, a proposed working title and acronym for the 
specification(s) to be developed by the TC.
===========================================================

UIMA Specification

===========================================================
4. The date, time, and location of the first meeting, whether it 
will be held in person or by phone, and who will sponsor this 
first meeting. The first meeting of a TC shall occur no less than 
30 days after the announcement of its formation in the case of a 
telephone or other electronic meeting, and no less than 45 days 
after the announcement of its formation in the case of a face-to-
face meeting.
===========================================================

Date: December 6, 2006
Time: 10:00 AM Eastern
Duration: 2 Hours
Mode: Teleconference
Number: TBD
Sponsor: IBM

===========================================================
5. The projected on-going meeting schedule for the year following 
the formation of the TC, or until the projected date of the final 
deliverable, whichever comes first, and who will be expected to 
sponsor these meetings.
===========================================================

Bi-weekly 90 Minute Teleconferences sponsored by IBM.

===========================================================
6. The names, electronic mail addresses, and membership 
affiliations of at least Minimum Membership who support this 
proposal and are committed to the Charter and projected meeting 
schedule 
===========================================================

In alphabetical order by last name:

1. Sophia Ananiadou, Sophia.ananiadou@manchester.ac.uk, 
University of Manchester, UK
2. Christopher G. Chute, chute@mayo.edu, Mayo Clinic College of 
Medicine
3. Pascal Coupet, pascal.coupet@temis.com
4. Hamish Cunningham, hamish@dcs.shef.ac.uk, University of 
Sheffield
5. David Ferrucci, ferrucci@us.ibm.com, IBM
6. Thomas Hampp, thomas.hampp@de.ibm.com, IBM
7. Jonathan D. Michel, Jonathan.D.Michel@saic.com, SAIC
8. Carl Madson, carl.madson@sri.com, SRI
9. Tim Miller, Tim.Miller@thomson.com, Thomson
10. Eric Nyberg, ehn@cs.cmu.edu, Carnegie Mellon University
11. Laurent Proulx, laurent.proulx@nstein.com Nstein Technologies
12. Alex Rankov, Rankov_Alex@emc.com, EMC
13. Antonio Sanfilippo, antonio.sanfilippo@pnl.gov
14. Junichi Tsuhii, J.Tsujii@manchester.ac.uk, School of Computer 
Science, University of Manchester, UK
15. Suzi Yoakum-Stover, suzanne.yoakum-stover@saic.com, Army 
Intelligence and Information Warfare Directorate

===========================================================
7. The name of the Convener who must be an Eligible Person.
===========================================================

Convener: David A. Ferrucci