Apache UIMA and Metadata Generation

Apache UIMA and
Metadata Generation
Gestione delle Informazioni su Web - 2009/2010
Tommaso Teoﬁli
tommaso [at] apache [dot] org

mercoledì 14 aprile 2010

Agenda
Unstructured information management

The ASF

Apache UIMA

Goals

Overview

Components

Usage


UIM ?

Unstructured Information Management

A wide topic: text, audio, video

Different (possibly mixed) approaches
(NLP, Machine Learning, IR, Ontologies,
Automated reasoning, Knowledge Sources)

Apache UIMA


Apache Software Foundation

No proﬁt corporation

“...provides organizational, legal, and ﬁnancial
support for a broad range of open source
software projects...”

“...collaborative and meritocratic development
process...”

“...pragmatic Apache License...”


Apache UIMA

Architectural framework to manage
unstructured data (Java, C++)

Just graduated as Apache Top Level Project

Former IBM research project donated to ASF

OASIS Standard


Apache UIMA - Goals

“Our goal is to support a thriving community
of users and developers of UIMA
frameworks, tools, and annotators, facilitating
the analysis of unstructured content such as
text, audio and video”


Apache UIMA - bridging worlds

Apache UIMA - Overview

UIMA supports the development, discovery,
composition and deployment of multi-modal
analytics for the analysis of unstructured
information and its integration with search
technologies


Apache UIMA -
Multimodal Analysis
Multimodal Analysis means the ability of
processing some resource from various
“points of view”

Sample: a video stream for which we want to
extract subtitles and also automatically
recognize the actors involved

We are though mainly interested in text...


Sample scenario
Content Management System containing free
text articles about movies

We want such articles to be automatically
enriched with metadata contained inside the
text (movies, directors, actors/actresses,
distribution) and linked to “similar” articles
(i.e.: dealing with same movies or actors)

So that we can search for “similar” articles


Sample scenario - articles
about movies

Sample scenario

UIMA can help on enriching articles with
metadata

Think of ﬁlling an Article.java instance
variables with proper values

Then persisting it to a database to query
articles dealing with the same actors


Filling Article with metadata

Sample scenario - metadata

UIMA - Annotations and Entities


Apache UIMA -
Annotation

The association of a metadata, such as a label,
with a region of text (or other type of artifact).

For example, the label “Person” associated with a
region of text “Fred Center” constitutes an
annotation. We say “Person” annotates the span
of text from X to Y containing exactly “Fred
Center”


Apache UIMA - Basic Steps

Domain model definition

Analysis pipeline definition

Arrange components:

Define components draining data from sources

Add and customize analysis components: Patterns,
Dictionaries, RegEx, External services, NLP, etc...

Define components outputting information on target
storages

Analysis pipeline(s) execution


Deﬁning domain model within
UIMA using Type Systems

Type System is the place where we describe which
metadata we would like to extract

Low representational gap

Like almost everything in UIMA: described (and
generated!) using XML

Possible to deﬁne multiple Type Systems for different
purposes


Defining domain model within
UIMA using Type Systems
Define at least a Type inside Type System for each
object inside the domain model

Useful to define more fine grained Types (for values of
type properties, called Features)

If we want to extract information about articles we
create an Article type inside the Type System

Also we’ll need to create annotations/entites for movies,
actors, directors, etc...

Types usually extends Annotation or TOP


Type System for Articles

How do UIMA extract
metadata?


Apache UIMA - Analysis
Engines

Basic UIMA building blocks

Analyze a document

Infer and record descriptive attributes
(about documents/regions)

Generating analysis results


Apache UIMA - AEs
Analysis Engines are described by a descriptor
(XML)

Can be Primitive (a single AE) or Aggregated (a
pipeline of AEs)

Analysis algorithms can be switched changing
descriptor instead of code

Contain TypeSystems deﬁnitions

Deﬁne Capabilites


Apache UIMA -
AnalysisComponent API

initialize : Performs (once) any startup tasks
required by this component

process : Process the resource to analyze
generating analysis results (metadata)

destroy : Frees all resources held, called only once
when it is ﬁnished using this component


Apache UIMA -
Annotators
Analysis Engine algorithm

Annotator : A software component
implemented to produce and record
annotations over regions of an artifact
(e.g., text document, audio, and video)

Annotators implement AnalysisComponent
interface


Apache UIMA - Roles
AnalysisEngine : High level block responsible
for analysis - contains at least one
AnalysisComponent

AnalysisComponent : interface for any
component responsible for analyzing artifacts

Annotator : implementation of
AnalysisComponent responsible for creating
Annotations


Apache UIMA - AEs


Analysis Engines in a
Pipeline

Apache UIMA - Analysis Results

Where do analysis results end up?

How annotators represent and share their
results?

CAS - Common Analysis Structure

Maintain typed indexes of extracted results


Common Analysis Structure

Which algorithms lay
under AEs?


Apache UIMA & NLP
NLP (Natural Language Processing) is a
theoretically motivated range of
computational techniques for analyzing and
representing naturally occurring texts at one
or more levels of linguistic analysis for the
purpose of achieving human-like language
processing for a range of tasks or
applications

It’s an AI discipline


Apache UIMA & NLP
“accomplish human-like language processing”

Paraphrase an input text

Translate the text into another language

Answer questions about the contents of
the text

Draw inferences from the text <--


Apache UIMA & NLP

“an NLP-based IR system has the goal of
providing more precise, complete information
in response to a user’s real information
need”

various levels of processing

that’s where we are!


Apache UIMA - First
Approaches

Simplest : Write RegEx and Dictionaries and
mix them together

NLP-like : Tokenize -> Sentence identiﬁcation
-> PoS Tagging -> Custom (Domain speciﬁc)
structures


Sample scenario -
extract actors
Tokenize article text

Identify sentences

Tag PoS

Identify Persons using regular expressions and PoS

Use Person annotations, Tokens’ PoS and Sentences
to extract relations between terms to identify
Persons who are also Actors


Sample scenario -
PersonAnnotator
I have a dictionary of names (simple to ﬁnd and/or build)

I use a DictionaryAnnotator to extract NameAnnotations

I don’t have a dictionary of surnames

Everytime a matching name (a NameAnnotation) is found we
look for one ore more (considering persons with double name
or surname) subsequent tokens whose PoS is “undeﬁned” or a
noun (but not a verb) and starts with Uppercase letter

If found then the name + token(s) sequence annotates a
Person (i.e. “Michael J. Fox”)


PersonAnnotator sample

Sample scenario
Getting actors can be simple if we know that
Persons who are also actors do some well known
actions

i.e.: a Person “stars as” CharacterInTheMovie (that
will be eventually tagged as Person too) when is
also an Actor

i.e.: if the snippet “CharacterInTheMovie (Person)”
exists, then Person is usually an Actor

then we can build an ActorAnnotator


Sample scenario

Apache UIMA
experience
Under SVN at

http://svn.apache.org/repos/asf/uima/uimaj/trunk/
uimaj-examples/

there are some examples and also the getting started
guides are very useful to start to get in touch with
UIMA

http://uima.apache.org/
documentation.html#getting_started

Subscribe to users@ and dev@uima.apache.org MLs


Apache UIMA - Components

Type Systems CAS Consumers

Analysis Engines Asynchronous
Scaleout
CAS
Sandbox
Collection Components
Processing
Manager/Engine Eclipse Plugins

Flow Controllers Tools


Apache UIMA - Flow Controllers

A component which implements the
interfaces needed to specify a custom ﬂow
within an Aggregate Analysis Engine

Enabling conditional pipelines


Apache UIMA - CAS Consumers

Components responsible for taking the
results from the CAS and storing them into a
database, or other storage device


Apache UIMA - Collection Processing
and a bigger picture


Apache UIMA -
Asynchronous Scaleout
add-on to the base Java framework,
supporting a very ﬂexible scaleout capability
based on JMS (Java Messaging Services) and
Apache ActiveMQ (a messaging an integration
patterns provider)

a powerful clustering solution very useful
when source documents size is huge


Apache UIMA - Sandbox Basics

Tokenizer

HMM Tagger

Dictionaries (DictionaryAnnotator,
ConceptMapper)

Snowball

ConﬁgurableFeatureExtractor


Apache UIMA - External Services

External IE engines exposing webservices
integrated easily inside UIMA:

AlchemyAPI Annotator

OpenCalais Annotator


Apache UIMA - Tika

Apache Tika is a toolkit for detecting and
extracting metadata and structured text
content from various documents using
existing parser libraries. The TikaAnnotator
uses Tika to generate annotations
representing the original markup of a
document, extract its text and metadata


Apache UIMA - Lucas

Very useful to build search engines!

stores CAS data on Lucene indexes

transforms annotation objects of a CAS
into Lucene token streams which are
stored in a Lucene document


Apache UIMA - Tools

JCasGen

PEAR Installer, Merger, Packager

Component Descriptor Editor

CPE Conﬁgurator

Java Annotation Viewer

CAS Visual Debugger

Document Analyzer


Apache UIMA
We can aggregate existing components or
write and deploy our new ones

There are lots of repositories for UIMA
containing open source analysis engines, type
systems, etc...

We though have to know better enough our
domain

Please mind the “false positives” issue


References
http://www.apache.org

http://uima.apache.org

http://www.oasis-open.org

http://www.cnlp.org/publications/03NLP.LIS.Encyclopedia.pdf

http://nlp.stanford.edu/

http://www.opencalais.com/gnosis/

http://www.dsi.unive.it/~marin/docs/hmm-it.pdf

http://en.wikipedia.org/wiki/Hidden_Markov_model


Apache UIMA and Metadata Generation

Recommended

Recommended

More Related Content

Similar to Apache UIMA and Metadata Generation

Similar to Apache UIMA and Metadata Generation (20)

More from Tommaso Teofili

More from Tommaso Teofili (14)

Apache UIMA and Metadata Generation