Slides about an overview about Apache UIMA and how it can be used for Metadata Generation in the context of the "Information management on the Web" course at DIA (Computer Science Department) of Roma Tre University
1. Apache UIMA and
Metadata Generation
Gestione delle Informazioni su Web - 2009/2010
Tommaso Teofili
tommaso [at] apache [dot] org
mercoledì 14 aprile 2010
2. Agenda
Unstructured information management
The ASF
Apache UIMA
Goals
Overview
Components
Usage
mercoledì 14 aprile 2010
3. UIM ?
Unstructured Information Management
A wide topic: text, audio, video
Different (possibly mixed) approaches
(NLP, Machine Learning, IR, Ontologies,
Automated reasoning, Knowledge Sources)
Apache UIMA
mercoledì 14 aprile 2010
4. Apache Software Foundation
No profit corporation
“...provides organizational, legal, and financial
support for a broad range of open source
software projects...”
“...collaborative and meritocratic development
process...”
“...pragmatic Apache License...”
mercoledì 14 aprile 2010
5. Apache UIMA
Architectural framework to manage
unstructured data (Java, C++)
Just graduated as Apache Top Level Project
Former IBM research project donated to ASF
OASIS Standard
mercoledì 14 aprile 2010
6. Apache UIMA - Goals
“Our goal is to support a thriving community
of users and developers of UIMA
frameworks, tools, and annotators, facilitating
the analysis of unstructured content such as
text, audio and video”
mercoledì 14 aprile 2010
7. Apache UIMA - bridging worlds
mercoledì 14 aprile 2010
8. Apache UIMA - Overview
UIMA supports the development, discovery,
composition and deployment of multi-modal
analytics for the analysis of unstructured
information and its integration with search
technologies
mercoledì 14 aprile 2010
9. Apache UIMA -
Multimodal Analysis
Multimodal Analysis means the ability of
processing some resource from various
“points of view”
Sample: a video stream for which we want to
extract subtitles and also automatically
recognize the actors involved
We are though mainly interested in text...
mercoledì 14 aprile 2010
10. Sample scenario
Content Management System containing free
text articles about movies
We want such articles to be automatically
enriched with metadata contained inside the
text (movies, directors, actors/actresses,
distribution) and linked to “similar” articles
(i.e.: dealing with same movies or actors)
So that we can search for “similar” articles
mercoledì 14 aprile 2010
12. Sample scenario
UIMA can help on enriching articles with
metadata
Think of filling an Article.java instance
variables with proper values
Then persisting it to a database to query
articles dealing with the same actors
mercoledì 14 aprile 2010
16. Apache UIMA -
Annotation
The association of a metadata, such as a label,
with a region of text (or other type of artifact).
For example, the label “Person” associated with a
region of text “Fred Center” constitutes an
annotation. We say “Person” annotates the span
of text from X to Y containing exactly “Fred
Center”
mercoledì 14 aprile 2010
17. Apache UIMA - Basic Steps
Domain model definition
Analysis pipeline definition
Arrange components:
Define components draining data from sources
Add and customize analysis components: Patterns,
Dictionaries, RegEx, External services, NLP, etc...
Define components outputting information on target
storages
Analysis pipeline(s) execution
mercoledì 14 aprile 2010
18. Defining domain model within
UIMA using Type Systems
Type System is the place where we describe which
metadata we would like to extract
Low representational gap
Like almost everything in UIMA: described (and
generated!) using XML
Possible to define multiple Type Systems for different
purposes
mercoledì 14 aprile 2010
19. Defining domain model within
UIMA using Type Systems
Define at least a Type inside Type System for each
object inside the domain model
Useful to define more fine grained Types (for values of
type properties, called Features)
If we want to extract information about articles we
create an Article type inside the Type System
Also we’ll need to create annotations/entites for movies,
actors, directors, etc...
Types usually extends Annotation or TOP
mercoledì 14 aprile 2010
21. How do UIMA extract
metadata?
mercoledì 14 aprile 2010
22. Apache UIMA - Analysis
Engines
Basic UIMA building blocks
Analyze a document
Infer and record descriptive attributes
(about documents/regions)
Generating analysis results
mercoledì 14 aprile 2010
23. Apache UIMA - AEs
Analysis Engines are described by a descriptor
(XML)
Can be Primitive (a single AE) or Aggregated (a
pipeline of AEs)
Analysis algorithms can be switched changing
descriptor instead of code
Contain TypeSystems definitions
Define Capabilites
mercoledì 14 aprile 2010
24. Apache UIMA -
AnalysisComponent API
initialize : Performs (once) any startup tasks
required by this component
process : Process the resource to analyze
generating analysis results (metadata)
destroy : Frees all resources held, called only once
when it is finished using this component
mercoledì 14 aprile 2010
25. Apache UIMA -
Annotators
Analysis Engine algorithm
Annotator : A software component
implemented to produce and record
annotations over regions of an artifact
(e.g., text document, audio, and video)
Annotators implement AnalysisComponent
interface
mercoledì 14 aprile 2010
26. Apache UIMA - Roles
AnalysisEngine : High level block responsible
for analysis - contains at least one
AnalysisComponent
AnalysisComponent : interface for any
component responsible for analyzing artifacts
Annotator : implementation of
AnalysisComponent responsible for creating
Annotations
mercoledì 14 aprile 2010
29. Apache UIMA - Analysis Results
Where do analysis results end up?
How annotators represent and share their
results?
CAS - Common Analysis Structure
Maintain typed indexes of extracted results
mercoledì 14 aprile 2010
32. Apache UIMA & NLP
NLP (Natural Language Processing) is a
theoretically motivated range of
computational techniques for analyzing and
representing naturally occurring texts at one
or more levels of linguistic analysis for the
purpose of achieving human-like language
processing for a range of tasks or
applications
It’s an AI discipline
mercoledì 14 aprile 2010
33. Apache UIMA & NLP
“accomplish human-like language processing”
Paraphrase an input text
Translate the text into another language
Answer questions about the contents of
the text
Draw inferences from the text <--
mercoledì 14 aprile 2010
34. Apache UIMA & NLP
“an NLP-based IR system has the goal of
providing more precise, complete information
in response to a user’s real information
need”
various levels of processing
that’s where we are!
mercoledì 14 aprile 2010
35. Apache UIMA - First
Approaches
Simplest : Write RegEx and Dictionaries and
mix them together
NLP-like : Tokenize -> Sentence identification
-> PoS Tagging -> Custom (Domain specific)
structures
mercoledì 14 aprile 2010
37. Sample scenario -
extract actors
Tokenize article text
Identify sentences
Tag PoS
Identify Persons using regular expressions and PoS
Use Person annotations, Tokens’ PoS and Sentences
to extract relations between terms to identify
Persons who are also Actors
mercoledì 14 aprile 2010
38. Sample scenario -
PersonAnnotator
I have a dictionary of names (simple to find and/or build)
I use a DictionaryAnnotator to extract NameAnnotations
I don’t have a dictionary of surnames
Everytime a matching name (a NameAnnotation) is found we
look for one ore more (considering persons with double name
or surname) subsequent tokens whose PoS is “undefined” or a
noun (but not a verb) and starts with Uppercase letter
If found then the name + token(s) sequence annotates a
Person (i.e. “Michael J. Fox”)
mercoledì 14 aprile 2010
41. Sample scenario
Getting actors can be simple if we know that
Persons who are also actors do some well known
actions
i.e.: a Person “stars as” CharacterInTheMovie (that
will be eventually tagged as Person too) when is
also an Actor
i.e.: if the snippet “CharacterInTheMovie (Person)”
exists, then Person is usually an Actor
then we can build an ActorAnnotator
mercoledì 14 aprile 2010
43. Apache UIMA
experience
Under SVN at
http://svn.apache.org/repos/asf/uima/uimaj/trunk/
uimaj-examples/
there are some examples and also the getting started
guides are very useful to start to get in touch with
UIMA
http://uima.apache.org/
documentation.html#getting_started
Subscribe to users@ and dev@uima.apache.org MLs
mercoledì 14 aprile 2010
44. Apache UIMA - Components
Type Systems CAS Consumers
Analysis Engines Asynchronous
Scaleout
CAS
Sandbox
Collection Components
Processing
Manager/Engine Eclipse Plugins
Flow Controllers Tools
mercoledì 14 aprile 2010
45. Apache UIMA - Flow Controllers
A component which implements the
interfaces needed to specify a custom flow
within an Aggregate Analysis Engine
Enabling conditional pipelines
mercoledì 14 aprile 2010
46. Apache UIMA - CAS Consumers
Components responsible for taking the
results from the CAS and storing them into a
database, or other storage device
mercoledì 14 aprile 2010
47. Apache UIMA - Collection Processing
and a bigger picture
mercoledì 14 aprile 2010
48. Apache UIMA -
Asynchronous Scaleout
add-on to the base Java framework,
supporting a very flexible scaleout capability
based on JMS (Java Messaging Services) and
Apache ActiveMQ (a messaging an integration
patterns provider)
a powerful clustering solution very useful
when source documents size is huge
mercoledì 14 aprile 2010
51. Apache UIMA - Tika
Apache Tika is a toolkit for detecting and
extracting metadata and structured text
content from various documents using
existing parser libraries. The TikaAnnotator
uses Tika to generate annotations
representing the original markup of a
document, extract its text and metadata
mercoledì 14 aprile 2010
52. Apache UIMA - Lucas
Very useful to build search engines!
stores CAS data on Lucene indexes
transforms annotation objects of a CAS
into Lucene token streams which are
stored in a Lucene document
mercoledì 14 aprile 2010
54. Apache UIMA
We can aggregate existing components or
write and deploy our new ones
There are lots of repositories for UIMA
containing open source analysis engines, type
systems, etc...
We though have to know better enough our
domain
Please mind the “false positives” issue
mercoledì 14 aprile 2010