SlideShare a Scribd company logo
1 of 55
Download to read offline
Apache UIMA and
                           Metadata Generation
                             Gestione delle Informazioni su Web - 2009/2010
                                              Tommaso Teofili
                                      tommaso [at] apache [dot] org




mercoledì 14 aprile 2010
Agenda
                           Unstructured information management

                           The ASF

                           Apache UIMA

                             Goals

                             Overview

                             Components

                             Usage


mercoledì 14 aprile 2010
UIM ?

                           Unstructured Information Management

                           A wide topic: text, audio, video

                             Different (possibly mixed) approaches
                             (NLP, Machine Learning, IR, Ontologies,
                             Automated reasoning, Knowledge Sources)

                           Apache UIMA



mercoledì 14 aprile 2010
Apache Software Foundation

                           No profit corporation

                           “...provides organizational, legal, and financial
                           support for a broad range of open source
                           software projects...”

                           “...collaborative and meritocratic development
                           process...”

                           “...pragmatic Apache License...”


mercoledì 14 aprile 2010
Apache UIMA

                           Architectural framework to manage
                           unstructured data (Java, C++)

                           Just graduated as Apache Top Level Project

                           Former IBM research project donated to ASF

                           OASIS Standard



mercoledì 14 aprile 2010
Apache UIMA - Goals


                           “Our goal is to support a thriving community
                           of users and developers of UIMA
                           frameworks, tools, and annotators, facilitating
                           the analysis of unstructured content such as
                           text, audio and video”




mercoledì 14 aprile 2010
Apache UIMA - bridging worlds
mercoledì 14 aprile 2010
Apache UIMA - Overview


                           UIMA supports the development, discovery,
                           composition and deployment of multi-modal
                           analytics for the analysis of unstructured
                           information and its integration with search
                           technologies




mercoledì 14 aprile 2010
Apache UIMA -
                            Multimodal Analysis
                           Multimodal Analysis means the ability of
                           processing some resource from various
                           “points of view”

                           Sample: a video stream for which we want to
                           extract subtitles and also automatically
                           recognize the actors involved

                           We are though mainly interested in text...



mercoledì 14 aprile 2010
Sample scenario
                           Content Management System containing free
                           text articles about movies

                           We want such articles to be automatically
                           enriched with metadata contained inside the
                           text (movies, directors, actors/actresses,
                           distribution) and linked to “similar” articles
                           (i.e.: dealing with same movies or actors)

                           So that we can search for “similar” articles


mercoledì 14 aprile 2010
Sample scenario - articles
                                 about movies
mercoledì 14 aprile 2010
Sample scenario

                           UIMA can help on enriching articles with
                           metadata

                           Think of filling an Article.java instance
                           variables with proper values

                           Then persisting it to a database to query
                           articles dealing with the same actors



mercoledì 14 aprile 2010
Filling Article with metadata
mercoledì 14 aprile 2010
Sample scenario - metadata
mercoledì 14 aprile 2010
UIMA - Annotations and Entities

mercoledì 14 aprile 2010
Apache UIMA -
                                  Annotation

                           The association of a metadata, such as a label,
                           with a region of text (or other type of artifact).

                           For example, the label “Person” associated with a
                           region of text “Fred Center” constitutes an
                           annotation. We say “Person” annotates the span
                           of text from X to Y containing exactly “Fred
                           Center”




mercoledì 14 aprile 2010
Apache UIMA - Basic Steps

                           Domain model definition

                           Analysis pipeline definition

                           Arrange components:

                               Define components draining data from sources

                               Add and customize analysis components: Patterns,
                               Dictionaries, RegEx, External services, NLP, etc...

                               Define components outputting information on target
                               storages

                           Analysis pipeline(s) execution

mercoledì 14 aprile 2010
Defining domain model within
                    UIMA using Type Systems

                           Type System is the place where we describe which
                           metadata we would like to extract

                           Low representational gap

                           Like almost everything in UIMA: described (and
                           generated!) using XML

                           Possible to define multiple Type Systems for different
                           purposes




mercoledì 14 aprile 2010
Defining domain model within
                    UIMA using Type Systems
                           Define at least a Type inside Type System for each
                           object inside the domain model

                           Useful to define more fine grained Types (for values of
                           type properties, called Features)

                           If we want to extract information about articles we
                           create an Article type inside the Type System

                           Also we’ll need to create annotations/entites for movies,
                           actors, directors, etc...

                           Types usually extends Annotation or TOP


mercoledì 14 aprile 2010
Type System for Articles
mercoledì 14 aprile 2010
How do UIMA extract
                                metadata?



mercoledì 14 aprile 2010
Apache UIMA - Analysis
                            Engines

                           Basic UIMA building blocks

                           Analyze a document

                             Infer and record descriptive attributes
                             (about documents/regions)

                           Generating analysis results




mercoledì 14 aprile 2010
Apache UIMA - AEs
                           Analysis Engines are described by a descriptor
                           (XML)

                           Can be Primitive (a single AE) or Aggregated (a
                           pipeline of AEs)

                           Analysis algorithms can be switched changing
                           descriptor instead of code

                           Contain TypeSystems definitions

                           Define Capabilites


mercoledì 14 aprile 2010
Apache UIMA -
                      AnalysisComponent API

                           initialize : Performs (once) any startup tasks
                           required by this component

                           process : Process the resource to analyze
                           generating analysis results (metadata)

                           destroy : Frees all resources held, called only once
                           when it is finished using this component




mercoledì 14 aprile 2010
Apache UIMA -
                                  Annotators
                           Analysis Engine algorithm

                             Annotator : A software component
                             implemented to produce and record
                             annotations over regions of an artifact
                             (e.g., text document, audio, and video)

                             Annotators implement AnalysisComponent
                             interface



mercoledì 14 aprile 2010
Apache UIMA - Roles
                           AnalysisEngine : High level block responsible
                           for analysis - contains at least one
                           AnalysisComponent

                           AnalysisComponent : interface for any
                           component responsible for analyzing artifacts

                           Annotator : implementation of
                           AnalysisComponent responsible for creating
                           Annotations


mercoledì 14 aprile 2010
Apache UIMA - AEs




mercoledì 14 aprile 2010
Analysis Engines in a
                                 Pipeline
mercoledì 14 aprile 2010
Apache UIMA - Analysis Results


                           Where do analysis results end up?

                           How annotators represent and share their
                           results?

                           CAS - Common Analysis Structure

                           Maintain typed indexes of extracted results



mercoledì 14 aprile 2010
Common Analysis Structure
mercoledì 14 aprile 2010
Which algorithms lay
                               under AEs?



mercoledì 14 aprile 2010
Apache UIMA & NLP
                           NLP (Natural Language Processing) is a
                           theoretically motivated range of
                           computational techniques for analyzing and
                           representing naturally occurring texts at one
                           or more levels of linguistic analysis for the
                           purpose of achieving human-like language
                           processing for a range of tasks or
                           applications

                           It’s an AI discipline


mercoledì 14 aprile 2010
Apache UIMA & NLP
                           “accomplish human-like language processing”

                             Paraphrase an input text

                             Translate the text into another language

                             Answer questions about the contents of
                             the text

                             Draw inferences from the text   <--

mercoledì 14 aprile 2010
Apache UIMA & NLP

                           “an NLP-based IR system has the goal of
                           providing more precise, complete information
                           in response to a user’s real information
                           need”

                           various levels of processing

                           that’s where we are!



mercoledì 14 aprile 2010
Apache UIMA - First
                               Approaches

                           Simplest : Write RegEx and Dictionaries and
                           mix them together

                           NLP-like : Tokenize -> Sentence identification
                           -> PoS Tagging -> Custom (Domain specific)
                           structures




mercoledì 14 aprile 2010
Analysis Engines in a
                                 Pipeline
mercoledì 14 aprile 2010
Sample scenario -
                               extract actors
                           Tokenize article text

                           Identify sentences

                           Tag PoS

                           Identify Persons using regular expressions and PoS

                           Use Person annotations, Tokens’ PoS and Sentences
                           to extract relations between terms to identify
                           Persons who are also Actors



mercoledì 14 aprile 2010
Sample scenario -
                                PersonAnnotator
                           I have a dictionary of names (simple to find and/or build)

                           I use a DictionaryAnnotator to extract NameAnnotations

                           I don’t have a dictionary of surnames

                           Everytime a matching name (a NameAnnotation) is found we
                           look for one ore more (considering persons with double name
                           or surname) subsequent tokens whose PoS is “undefined” or a
                           noun (but not a verb) and starts with Uppercase letter

                           If found then the name + token(s) sequence annotates a
                           Person (i.e. “Michael J. Fox”)



mercoledì 14 aprile 2010
PersonAnnotator sample
mercoledì 14 aprile 2010
Sample scenario - articles
                                 about movies
mercoledì 14 aprile 2010
Sample scenario
                           Getting actors can be simple if we know that
                           Persons who are also actors do some well known
                           actions

                           i.e.: a Person “stars as” CharacterInTheMovie (that
                           will be eventually tagged as Person too) when is
                           also an Actor

                           i.e.: if the snippet “CharacterInTheMovie (Person)”
                           exists, then Person is usually an Actor

                           then we can build an ActorAnnotator


mercoledì 14 aprile 2010
Sample scenario
mercoledì 14 aprile 2010
Apache UIMA
                                    experience
                           Under SVN at

                              http://svn.apache.org/repos/asf/uima/uimaj/trunk/
                              uimaj-examples/

                           there are some examples and also the getting started
                           guides are very useful to start to get in touch with
                           UIMA

                              http://uima.apache.org/
                              documentation.html#getting_started

                           Subscribe to users@ and dev@uima.apache.org MLs


mercoledì 14 aprile 2010
Apache UIMA - Components

                           Type Systems       CAS Consumers

                           Analysis Engines   Asynchronous
                                              Scaleout
                           CAS
                                              Sandbox
                           Collection         Components
                           Processing
                           Manager/Engine     Eclipse Plugins

                           Flow Controllers   Tools


mercoledì 14 aprile 2010
Apache UIMA - Flow Controllers



                           A component which implements the
                           interfaces needed to specify a custom flow
                           within an Aggregate Analysis Engine

                           Enabling conditional pipelines




mercoledì 14 aprile 2010
Apache UIMA - CAS Consumers




                           Components responsible for taking the
                           results from the CAS and storing them into a
                           database, or other storage device




mercoledì 14 aprile 2010
Apache UIMA - Collection Processing
                        and a bigger picture




mercoledì 14 aprile 2010
Apache UIMA -
                           Asynchronous Scaleout
                           add-on to the base Java framework,
                           supporting a very flexible scaleout capability
                           based on JMS (Java Messaging Services) and
                           Apache ActiveMQ (a messaging an integration
                           patterns provider)

                           a powerful clustering solution very useful
                           when source documents size is huge



mercoledì 14 aprile 2010
Apache UIMA - Sandbox Basics

                           Tokenizer

                           HMM Tagger

                           Dictionaries (DictionaryAnnotator,
                           ConceptMapper)

                           Snowball

                           ConfigurableFeatureExtractor


mercoledì 14 aprile 2010
Apache UIMA - External Services



                           External IE engines exposing webservices
                           integrated easily inside UIMA:

                             AlchemyAPI Annotator

                             OpenCalais Annotator




mercoledì 14 aprile 2010
Apache UIMA - Tika

                           Apache Tika is a toolkit for detecting and
                           extracting metadata and structured text
                           content from various documents using
                           existing parser libraries. The TikaAnnotator
                           uses Tika to generate annotations
                           representing the original markup of a
                           document, extract its text and metadata



mercoledì 14 aprile 2010
Apache UIMA - Lucas

                           Very useful to build search engines!

                             stores CAS data on Lucene indexes

                             transforms annotation objects of a CAS
                             into Lucene token streams which are
                             stored in a Lucene document




mercoledì 14 aprile 2010
Apache UIMA - Tools

                           JCasGen

                           PEAR Installer, Merger, Packager

                           Component Descriptor Editor

                           CPE Configurator

                           Java Annotation Viewer

                           CAS Visual Debugger

                           Document Analyzer




mercoledì 14 aprile 2010
Apache UIMA
                           We can aggregate existing components or
                           write and deploy our new ones

                           There are lots of repositories for UIMA
                           containing open source analysis engines, type
                           systems, etc...

                           We though have to know better enough our
                           domain

                           Please mind the “false positives” issue

mercoledì 14 aprile 2010
References
                           http://www.apache.org

                           http://uima.apache.org

                           http://www.oasis-open.org

                           http://www.cnlp.org/publications/03NLP.LIS.Encyclopedia.pdf

                           http://nlp.stanford.edu/

                           http://www.opencalais.com/gnosis/

                           http://www.dsi.unive.it/~marin/docs/hmm-it.pdf

                           http://en.wikipedia.org/wiki/Hidden_Markov_model




mercoledì 14 aprile 2010

More Related Content

Similar to Apache UIMA and Metadata Generation

Discussion for Anomaly & Prediction Engine
Discussion for Anomaly & Prediction EngineDiscussion for Anomaly & Prediction Engine
Discussion for Anomaly & Prediction EngineHisashiOsanai
 
Applying Semantic Extensions And New Services To Drupal Sem Tech June 2010
Applying Semantic Extensions And New Services To Drupal   Sem Tech June 2010Applying Semantic Extensions And New Services To Drupal   Sem Tech June 2010
Applying Semantic Extensions And New Services To Drupal Sem Tech June 2010AI4BD GmbH
 
OpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene EcosystemOpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene EcosystemGrant Ingersoll
 
Partly Cloudy... with a chance of hype
Partly Cloudy... with a chance of hypePartly Cloudy... with a chance of hype
Partly Cloudy... with a chance of hypeMike Brevoort
 
ImpressCMS Persistable Framework: Rapid Modules Development
ImpressCMS Persistable Framework: Rapid Modules DevelopmentImpressCMS Persistable Framework: Rapid Modules Development
ImpressCMS Persistable Framework: Rapid Modules DevelopmentINBOX International inc.
 
PHP and the Cloud (phpbenelux conference)
PHP and the Cloud (phpbenelux conference)PHP and the Cloud (phpbenelux conference)
PHP and the Cloud (phpbenelux conference)Ivo Jansch
 
Jonas Schneider, Head of Engineering for Robotics, OpenAI
Jonas Schneider, Head of Engineering for Robotics, OpenAIJonas Schneider, Head of Engineering for Robotics, OpenAI
Jonas Schneider, Head of Engineering for Robotics, OpenAIMLconf
 
python project jarvis ppt.pptx
python project jarvis ppt.pptxpython project jarvis ppt.pptx
python project jarvis ppt.pptxVikashKumarMehta5
 
Data and Information Extraction on the Web
Data and Information Extraction on the WebData and Information Extraction on the Web
Data and Information Extraction on the WebTommaso Teofili
 
Tim - FSharp
Tim - FSharpTim - FSharp
Tim - FSharpd0nn9n
 
OW2con'14 - erOCCI, a scalable, model-based REST API framework
OW2con'14 - erOCCI, a scalable, model-based REST API frameworkOW2con'14 - erOCCI, a scalable, model-based REST API framework
OW2con'14 - erOCCI, a scalable, model-based REST API frameworkOW2
 
erocci, a scalable model-driven REST framework
erocci, a scalable model-driven REST frameworkerocci, a scalable model-driven REST framework
erocci, a scalable model-driven REST frameworkJean Parpaillon
 
The scripting library: Combining data and information in the library
The scripting library: Combining data and information in the libraryThe scripting library: Combining data and information in the library
The scripting library: Combining data and information in the libraryBonaria Biancu
 
Model-based Research in Human-Computer Interaction (HCI): Keynote at Mensch u...
Model-based Research in Human-Computer Interaction (HCI): Keynote at Mensch u...Model-based Research in Human-Computer Interaction (HCI): Keynote at Mensch u...
Model-based Research in Human-Computer Interaction (HCI): Keynote at Mensch u...Ed Chi
 
SAP REST Summit 2009 - Atom At Work
SAP REST Summit 2009 - Atom At WorkSAP REST Summit 2009 - Atom At Work
SAP REST Summit 2009 - Atom At WorkJuergen Schmerder
 

Similar to Apache UIMA and Metadata Generation (20)

Discussion for Anomaly & Prediction Engine
Discussion for Anomaly & Prediction EngineDiscussion for Anomaly & Prediction Engine
Discussion for Anomaly & Prediction Engine
 
Applying Semantic Extensions And New Services To Drupal Sem Tech June 2010
Applying Semantic Extensions And New Services To Drupal   Sem Tech June 2010Applying Semantic Extensions And New Services To Drupal   Sem Tech June 2010
Applying Semantic Extensions And New Services To Drupal Sem Tech June 2010
 
OpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene EcosystemOpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene Ecosystem
 
Partly Cloudy... with a chance of hype
Partly Cloudy... with a chance of hypePartly Cloudy... with a chance of hype
Partly Cloudy... with a chance of hype
 
ImpressCMS Persistable Framework: Rapid Modules Development
ImpressCMS Persistable Framework: Rapid Modules DevelopmentImpressCMS Persistable Framework: Rapid Modules Development
ImpressCMS Persistable Framework: Rapid Modules Development
 
PHP and the Cloud (phpbenelux conference)
PHP and the Cloud (phpbenelux conference)PHP and the Cloud (phpbenelux conference)
PHP and the Cloud (phpbenelux conference)
 
Arakno
AraknoArakno
Arakno
 
Jonas Schneider, Head of Engineering for Robotics, OpenAI
Jonas Schneider, Head of Engineering for Robotics, OpenAIJonas Schneider, Head of Engineering for Robotics, OpenAI
Jonas Schneider, Head of Engineering for Robotics, OpenAI
 
IBM Watson
IBM WatsonIBM Watson
IBM Watson
 
python project jarvis ppt.pptx
python project jarvis ppt.pptxpython project jarvis ppt.pptx
python project jarvis ppt.pptx
 
Data and Information Extraction on the Web
Data and Information Extraction on the WebData and Information Extraction on the Web
Data and Information Extraction on the Web
 
Lucene revolution with Data Harmony
Lucene revolution with Data HarmonyLucene revolution with Data Harmony
Lucene revolution with Data Harmony
 
Tim - FSharp
Tim - FSharpTim - FSharp
Tim - FSharp
 
ImpressCMS IPF Webcast Session 1
ImpressCMS IPF Webcast Session 1ImpressCMS IPF Webcast Session 1
ImpressCMS IPF Webcast Session 1
 
OW2con'14 - erOCCI, a scalable, model-based REST API framework
OW2con'14 - erOCCI, a scalable, model-based REST API frameworkOW2con'14 - erOCCI, a scalable, model-based REST API framework
OW2con'14 - erOCCI, a scalable, model-based REST API framework
 
erocci, a scalable model-driven REST framework
erocci, a scalable model-driven REST frameworkerocci, a scalable model-driven REST framework
erocci, a scalable model-driven REST framework
 
The scripting library: Combining data and information in the library
The scripting library: Combining data and information in the libraryThe scripting library: Combining data and information in the library
The scripting library: Combining data and information in the library
 
Model-based Research in Human-Computer Interaction (HCI): Keynote at Mensch u...
Model-based Research in Human-Computer Interaction (HCI): Keynote at Mensch u...Model-based Research in Human-Computer Interaction (HCI): Keynote at Mensch u...
Model-based Research in Human-Computer Interaction (HCI): Keynote at Mensch u...
 
AI & ML
AI & MLAI & ML
AI & ML
 
SAP REST Summit 2009 - Atom At Work
SAP REST Summit 2009 - Atom At WorkSAP REST Summit 2009 - Atom At Work
SAP REST Summit 2009 - Atom At Work
 

More from Tommaso Teofili

Affect Enriched Word Embeddings for News IR
Affect Enriched Word Embeddings for News IRAffect Enriched Word Embeddings for News IR
Affect Enriched Word Embeddings for News IRTommaso Teofili
 
Flexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit OakFlexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit OakTommaso Teofili
 
Data replication in Sling
Data replication in SlingData replication in Sling
Data replication in SlingTommaso Teofili
 
Search engines in the industry
Search engines in the industrySearch engines in the industry
Search engines in the industryTommaso Teofili
 
Scaling search in Oak with Solr
Scaling search in Oak with Solr Scaling search in Oak with Solr
Scaling search in Oak with Solr Tommaso Teofili
 
Text categorization with Lucene and Solr
Text categorization with Lucene and SolrText categorization with Lucene and Solr
Text categorization with Lucene and SolrTommaso Teofili
 
Machine learning with Apache Hama
Machine learning with Apache HamaMachine learning with Apache Hama
Machine learning with Apache HamaTommaso Teofili
 
Adapting Apache UIMA to OSGi
Adapting Apache UIMA to OSGiAdapting Apache UIMA to OSGi
Adapting Apache UIMA to OSGiTommaso Teofili
 
Domeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and ClerezzaDomeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and ClerezzaTommaso Teofili
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash courseTommaso Teofili
 
OSS Enterprise Search EU Tour
OSS Enterprise Search EU TourOSS Enterprise Search EU Tour
OSS Enterprise Search EU TourTommaso Teofili
 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platformTommaso Teofili
 
Information Extraction with UIMA - Usecases
Information Extraction with UIMA - UsecasesInformation Extraction with UIMA - Usecases
Information Extraction with UIMA - UsecasesTommaso Teofili
 

More from Tommaso Teofili (14)

Affect Enriched Word Embeddings for News IR
Affect Enriched Word Embeddings for News IRAffect Enriched Word Embeddings for News IR
Affect Enriched Word Embeddings for News IR
 
Flexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit OakFlexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit Oak
 
Data replication in Sling
Data replication in SlingData replication in Sling
Data replication in Sling
 
Search engines in the industry
Search engines in the industrySearch engines in the industry
Search engines in the industry
 
Scaling search in Oak with Solr
Scaling search in Oak with Solr Scaling search in Oak with Solr
Scaling search in Oak with Solr
 
Text categorization with Lucene and Solr
Text categorization with Lucene and SolrText categorization with Lucene and Solr
Text categorization with Lucene and Solr
 
Machine learning with Apache Hama
Machine learning with Apache HamaMachine learning with Apache Hama
Machine learning with Apache Hama
 
Adapting Apache UIMA to OSGi
Adapting Apache UIMA to OSGiAdapting Apache UIMA to OSGi
Adapting Apache UIMA to OSGi
 
Oak / Solr integration
Oak / Solr integrationOak / Solr integration
Oak / Solr integration
 
Domeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and ClerezzaDomeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and Clerezza
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 
OSS Enterprise Search EU Tour
OSS Enterprise Search EU TourOSS Enterprise Search EU Tour
OSS Enterprise Search EU Tour
 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platform
 
Information Extraction with UIMA - Usecases
Information Extraction with UIMA - UsecasesInformation Extraction with UIMA - Usecases
Information Extraction with UIMA - Usecases
 

Apache UIMA and Metadata Generation

  • 1. Apache UIMA and Metadata Generation Gestione delle Informazioni su Web - 2009/2010 Tommaso Teofili tommaso [at] apache [dot] org mercoledì 14 aprile 2010
  • 2. Agenda Unstructured information management The ASF Apache UIMA Goals Overview Components Usage mercoledì 14 aprile 2010
  • 3. UIM ? Unstructured Information Management A wide topic: text, audio, video Different (possibly mixed) approaches (NLP, Machine Learning, IR, Ontologies, Automated reasoning, Knowledge Sources) Apache UIMA mercoledì 14 aprile 2010
  • 4. Apache Software Foundation No profit corporation “...provides organizational, legal, and financial support for a broad range of open source software projects...” “...collaborative and meritocratic development process...” “...pragmatic Apache License...” mercoledì 14 aprile 2010
  • 5. Apache UIMA Architectural framework to manage unstructured data (Java, C++) Just graduated as Apache Top Level Project Former IBM research project donated to ASF OASIS Standard mercoledì 14 aprile 2010
  • 6. Apache UIMA - Goals “Our goal is to support a thriving community of users and developers of UIMA frameworks, tools, and annotators, facilitating the analysis of unstructured content such as text, audio and video” mercoledì 14 aprile 2010
  • 7. Apache UIMA - bridging worlds mercoledì 14 aprile 2010
  • 8. Apache UIMA - Overview UIMA supports the development, discovery, composition and deployment of multi-modal analytics for the analysis of unstructured information and its integration with search technologies mercoledì 14 aprile 2010
  • 9. Apache UIMA - Multimodal Analysis Multimodal Analysis means the ability of processing some resource from various “points of view” Sample: a video stream for which we want to extract subtitles and also automatically recognize the actors involved We are though mainly interested in text... mercoledì 14 aprile 2010
  • 10. Sample scenario Content Management System containing free text articles about movies We want such articles to be automatically enriched with metadata contained inside the text (movies, directors, actors/actresses, distribution) and linked to “similar” articles (i.e.: dealing with same movies or actors) So that we can search for “similar” articles mercoledì 14 aprile 2010
  • 11. Sample scenario - articles about movies mercoledì 14 aprile 2010
  • 12. Sample scenario UIMA can help on enriching articles with metadata Think of filling an Article.java instance variables with proper values Then persisting it to a database to query articles dealing with the same actors mercoledì 14 aprile 2010
  • 13. Filling Article with metadata mercoledì 14 aprile 2010
  • 14. Sample scenario - metadata mercoledì 14 aprile 2010
  • 15. UIMA - Annotations and Entities mercoledì 14 aprile 2010
  • 16. Apache UIMA - Annotation The association of a metadata, such as a label, with a region of text (or other type of artifact). For example, the label “Person” associated with a region of text “Fred Center” constitutes an annotation. We say “Person” annotates the span of text from X to Y containing exactly “Fred Center” mercoledì 14 aprile 2010
  • 17. Apache UIMA - Basic Steps Domain model definition Analysis pipeline definition Arrange components: Define components draining data from sources Add and customize analysis components: Patterns, Dictionaries, RegEx, External services, NLP, etc... Define components outputting information on target storages Analysis pipeline(s) execution mercoledì 14 aprile 2010
  • 18. Defining domain model within UIMA using Type Systems Type System is the place where we describe which metadata we would like to extract Low representational gap Like almost everything in UIMA: described (and generated!) using XML Possible to define multiple Type Systems for different purposes mercoledì 14 aprile 2010
  • 19. Defining domain model within UIMA using Type Systems Define at least a Type inside Type System for each object inside the domain model Useful to define more fine grained Types (for values of type properties, called Features) If we want to extract information about articles we create an Article type inside the Type System Also we’ll need to create annotations/entites for movies, actors, directors, etc... Types usually extends Annotation or TOP mercoledì 14 aprile 2010
  • 20. Type System for Articles mercoledì 14 aprile 2010
  • 21. How do UIMA extract metadata? mercoledì 14 aprile 2010
  • 22. Apache UIMA - Analysis Engines Basic UIMA building blocks Analyze a document Infer and record descriptive attributes (about documents/regions) Generating analysis results mercoledì 14 aprile 2010
  • 23. Apache UIMA - AEs Analysis Engines are described by a descriptor (XML) Can be Primitive (a single AE) or Aggregated (a pipeline of AEs) Analysis algorithms can be switched changing descriptor instead of code Contain TypeSystems definitions Define Capabilites mercoledì 14 aprile 2010
  • 24. Apache UIMA - AnalysisComponent API initialize : Performs (once) any startup tasks required by this component process : Process the resource to analyze generating analysis results (metadata) destroy : Frees all resources held, called only once when it is finished using this component mercoledì 14 aprile 2010
  • 25. Apache UIMA - Annotators Analysis Engine algorithm Annotator : A software component implemented to produce and record annotations over regions of an artifact (e.g., text document, audio, and video) Annotators implement AnalysisComponent interface mercoledì 14 aprile 2010
  • 26. Apache UIMA - Roles AnalysisEngine : High level block responsible for analysis - contains at least one AnalysisComponent AnalysisComponent : interface for any component responsible for analyzing artifacts Annotator : implementation of AnalysisComponent responsible for creating Annotations mercoledì 14 aprile 2010
  • 27. Apache UIMA - AEs mercoledì 14 aprile 2010
  • 28. Analysis Engines in a Pipeline mercoledì 14 aprile 2010
  • 29. Apache UIMA - Analysis Results Where do analysis results end up? How annotators represent and share their results? CAS - Common Analysis Structure Maintain typed indexes of extracted results mercoledì 14 aprile 2010
  • 31. Which algorithms lay under AEs? mercoledì 14 aprile 2010
  • 32. Apache UIMA & NLP NLP (Natural Language Processing) is a theoretically motivated range of computational techniques for analyzing and representing naturally occurring texts at one or more levels of linguistic analysis for the purpose of achieving human-like language processing for a range of tasks or applications It’s an AI discipline mercoledì 14 aprile 2010
  • 33. Apache UIMA & NLP “accomplish human-like language processing” Paraphrase an input text Translate the text into another language Answer questions about the contents of the text Draw inferences from the text <-- mercoledì 14 aprile 2010
  • 34. Apache UIMA & NLP “an NLP-based IR system has the goal of providing more precise, complete information in response to a user’s real information need” various levels of processing that’s where we are! mercoledì 14 aprile 2010
  • 35. Apache UIMA - First Approaches Simplest : Write RegEx and Dictionaries and mix them together NLP-like : Tokenize -> Sentence identification -> PoS Tagging -> Custom (Domain specific) structures mercoledì 14 aprile 2010
  • 36. Analysis Engines in a Pipeline mercoledì 14 aprile 2010
  • 37. Sample scenario - extract actors Tokenize article text Identify sentences Tag PoS Identify Persons using regular expressions and PoS Use Person annotations, Tokens’ PoS and Sentences to extract relations between terms to identify Persons who are also Actors mercoledì 14 aprile 2010
  • 38. Sample scenario - PersonAnnotator I have a dictionary of names (simple to find and/or build) I use a DictionaryAnnotator to extract NameAnnotations I don’t have a dictionary of surnames Everytime a matching name (a NameAnnotation) is found we look for one ore more (considering persons with double name or surname) subsequent tokens whose PoS is “undefined” or a noun (but not a verb) and starts with Uppercase letter If found then the name + token(s) sequence annotates a Person (i.e. “Michael J. Fox”) mercoledì 14 aprile 2010
  • 40. Sample scenario - articles about movies mercoledì 14 aprile 2010
  • 41. Sample scenario Getting actors can be simple if we know that Persons who are also actors do some well known actions i.e.: a Person “stars as” CharacterInTheMovie (that will be eventually tagged as Person too) when is also an Actor i.e.: if the snippet “CharacterInTheMovie (Person)” exists, then Person is usually an Actor then we can build an ActorAnnotator mercoledì 14 aprile 2010
  • 43. Apache UIMA experience Under SVN at http://svn.apache.org/repos/asf/uima/uimaj/trunk/ uimaj-examples/ there are some examples and also the getting started guides are very useful to start to get in touch with UIMA http://uima.apache.org/ documentation.html#getting_started Subscribe to users@ and dev@uima.apache.org MLs mercoledì 14 aprile 2010
  • 44. Apache UIMA - Components Type Systems CAS Consumers Analysis Engines Asynchronous Scaleout CAS Sandbox Collection Components Processing Manager/Engine Eclipse Plugins Flow Controllers Tools mercoledì 14 aprile 2010
  • 45. Apache UIMA - Flow Controllers A component which implements the interfaces needed to specify a custom flow within an Aggregate Analysis Engine Enabling conditional pipelines mercoledì 14 aprile 2010
  • 46. Apache UIMA - CAS Consumers Components responsible for taking the results from the CAS and storing them into a database, or other storage device mercoledì 14 aprile 2010
  • 47. Apache UIMA - Collection Processing and a bigger picture mercoledì 14 aprile 2010
  • 48. Apache UIMA - Asynchronous Scaleout add-on to the base Java framework, supporting a very flexible scaleout capability based on JMS (Java Messaging Services) and Apache ActiveMQ (a messaging an integration patterns provider) a powerful clustering solution very useful when source documents size is huge mercoledì 14 aprile 2010
  • 49. Apache UIMA - Sandbox Basics Tokenizer HMM Tagger Dictionaries (DictionaryAnnotator, ConceptMapper) Snowball ConfigurableFeatureExtractor mercoledì 14 aprile 2010
  • 50. Apache UIMA - External Services External IE engines exposing webservices integrated easily inside UIMA: AlchemyAPI Annotator OpenCalais Annotator mercoledì 14 aprile 2010
  • 51. Apache UIMA - Tika Apache Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries. The TikaAnnotator uses Tika to generate annotations representing the original markup of a document, extract its text and metadata mercoledì 14 aprile 2010
  • 52. Apache UIMA - Lucas Very useful to build search engines! stores CAS data on Lucene indexes transforms annotation objects of a CAS into Lucene token streams which are stored in a Lucene document mercoledì 14 aprile 2010
  • 53. Apache UIMA - Tools JCasGen PEAR Installer, Merger, Packager Component Descriptor Editor CPE Configurator Java Annotation Viewer CAS Visual Debugger Document Analyzer mercoledì 14 aprile 2010
  • 54. Apache UIMA We can aggregate existing components or write and deploy our new ones There are lots of repositories for UIMA containing open source analysis engines, type systems, etc... We though have to know better enough our domain Please mind the “false positives” issue mercoledì 14 aprile 2010
  • 55. References http://www.apache.org http://uima.apache.org http://www.oasis-open.org http://www.cnlp.org/publications/03NLP.LIS.Encyclopedia.pdf http://nlp.stanford.edu/ http://www.opencalais.com/gnosis/ http://www.dsi.unive.it/~marin/docs/hmm-it.pdf http://en.wikipedia.org/wiki/Hidden_Markov_model mercoledì 14 aprile 2010