SlideShare a Scribd company logo
1 of 51
Download to read offline
Apache UIMA
   Introduction
Gestione delle Informazioni su Web - 2010/2011
                Tommaso Teofili
        tommaso [at] apache [dot] org
UIM ?

Unstructured Information Management

A wide topic: text, audio, video

  Different (possibly mixed) approaches
  (NLP, Machine Learning, IR, Ontologies,
  Automated reasoning, Knowledge Sources)

Apache UIMA
Apache Software Foundation

  No profit corporation

  “...provides organizational, legal, and financial
  support for a broad range of open source
  software projects...”

  “...collaborative and meritocratic development
  process...”

  “...pragmatic Apache License...”
Apache UIMA


Architectural framework to manage
unstructured data (Java, C++, ...)

Former IBM research project donated to ASF

OASIS Standard for unstructured
information management
Apache UIMA - Goals


“Our goal is to support a thriving community
of users and developers of UIMA
frameworks, tools, and annotators, facilitating
the analysis of unstructured content such as
text, audio and video”
Apache UIMA - bridging worlds
Apache UIMA - Overview


 UIMA supports the development, discovery,
 composition and deployment of multi-modal
 analytics for the analysis of unstructured
 information and its integration with search
 technologies
Apache UIMA -
 Multimodal Analysis
Multimodal Analysis means the ability of
processing some resource from various
“points of view”

Sample: a video stream for which we want to
extract subtitles and also automatically
recognize the actors involved

We are though mainly interested in text...
Sample scenario
Content Management System containing free
text articles about movies

We want such articles to be automatically
enriched with metadata contained inside the
text (movies, directors, actors/actresses,
distribution) and linked to “similar” articles
(i.e.: dealing with same movies or actors)

So that we can search for “similar” articles
Sample scenario - articles
      about movies
Sample scenario

UIMA can help on enriching articles with
metadata

Think of filling an Article.java instance
variables with proper values

Then persisting it to a database to query
articles dealing with the same actors
Filling Article with metadata
Sample scenario - metadata
UIMA - Annotations
Apache UIMA -
       Annotation

The association of a metadata, such as a label,
with a region of text (or other type of artifact).

For example, the label “Person” associated with a
region of text “Fred Center” constitutes an
annotation. We say “Person” annotates the span
of text from X to Y containing exactly “Fred
Center”
Apache UIMA - Basic Steps

  Domain model definition

  Analysis pipeline definition

  Arrange components:

      Define components draining data from sources

      Add and customize analysis components: Patterns,
      Dictionaries, RegEx, External services, NLP, etc...

      Define components outputting information on target
      storages

  Analysis pipeline(s) execution
Defining domain model within
 UIMA using Type Systems

 Type System is the place where we describe which
 metadata we would like to extract

 Low representational gap

 Like almost everything in UIMA: described (and
 generated!) using XML

 Possible to define multiple Type Systems for different
 purposes
How do UIMA extract
     metadata?
Apache UIMA - Analysis
       Engines

 Basic UIMA building blocks

 Analyze a document

   Infer and record descriptive attributes
   (about documents/regions)

 Generating analysis results
Apache UIMA - AEs
Analysis Engines are described by a descriptor
(XML)

Can be Primitive (a single AE) or Aggregated (a
pipeline of AEs)

Analysis algorithms can be switched changing
descriptor instead of code

Contain TypeSystems definitions

Define Capabilites
Apache UIMA -
AnalysisComponent API

 initialize : Performs (once) any startup tasks
 required by this component

 process : Process the resource to analyze
 generating analysis results (metadata)

 destroy : Frees all resources held, called only once
 when it is finished using this component
Apache UIMA -
       Annotators
Analysis Engine algorithm

  Annotator : A software component
  implemented to produce and record
  annotations over regions of an artifact
  (e.g., text document, audio, and video)

  Annotators implement AnalysisComponent
  interface
Apache UIMA - Roles
AnalysisEngine : High level block responsible
for analysis - contains at least one
AnalysisComponent

AnalysisComponent : interface for any
component responsible for analyzing artifacts

Annotator : implementation of
AnalysisComponent responsible for creating
Annotations
Apache UIMA - AEs
Analysis Engines in a
      Pipeline
Apache UIMA - Analysis Results


  Where do analysis results end up?

  How annotators represent and share their
  results?

  CAS - Common Analysis Structure

  Maintain typed indexes of extracted results
Common Analysis Structure
Which algorithms lay
    under AEs?
Apache UIMA & NLP
NLP (Natural Language Processing) is a
theoretically motivated range of
computational techniques for analyzing and
representing naturally occurring texts at one
or more levels of linguistic analysis for the
purpose of achieving human-like language
processing for a range of tasks or
applications

It’s an AI discipline
Apache UIMA & NLP

“accomplish human-like language processing”

  Paraphrase an input text

  Translate the text into another language

  Answer questions about the contents of
  the text

  Draw inferences from the text
Apache UIMA & NLP


“an NLP-based IR system has the goal of
providing more precise, complete information
in response to a user’s real information
need”

various levels of processing
Apache UIMA -
       Approaches

Simplest : Write RegEx and Dictionaries and
mix them together

NLP-like : Tokenize -> Sentence identification
-> PoS Tagging -> Anaphora resolution ->
Named Entities Recognition -> Coreference
Identification ...
Analysis Engines in a
      Pipeline
NLP - Language Identifying

  NLP takes advantage of language specific
  syntax, forms, rules and meanings

  Not easy to write language independent
  extraction algorithms

  Often this is the first block of NLP pipelines

  Techniques: Stopwords dictionaries, statistical
  models, etc.
NLP - Tokens and Sentences

  Humans learn words’ meaning in order to
  understand whole context semantics

  Split the target text in words to be able to
  analyze their meaning and role

  Discover sentences to later assign roles to
  each token

  Easiest for English, Italian & co. but what
  about Chinese?
NLP - PoS Tagging
Assign a “Part of Speech” (noun, adjective,
verb, etc.) to each token generated in the
previous step

Many language/domain specific patterns can
be discovered and exploited just with pos-
tagged-tokens and sentences
NLP - Chunking & Parsing
 Parse sentences into a meaningful set or
 tree of relationships

 Chunks are the sentence building blocks (i.e.
 verbal forms)

 Parse tree highlights the structure of a
 sentence

 Can leverage logic analysis



    chunking                          parsing
NLP - Named Entities
       Recognition
Answer the
questions: where?
when? who? how
often? how much?

Identify key entities
in the text

Common techniques:
dictionaries, rules,
statistcal models
Debugging NER in UIMA
Using UIMA


Define TypeSystem

Define AnalysisEngine descriptor(s)

Implement Annotator(s)

Execute the UIMA pipeline
Sample scenario -
    extract actors
Tokenize article text

Identify sentences

Tag PoS

Identify Persons using regular expressions and PoS

Use Person annotations, Tokens’ PoS and Sentences
to extract relations between terms to identify
Persons who are also Actors
Sample scenario -
     extract persons
I have a dictionary of names (simple to find and/or build)

I use a dictionary based Annotator to extract annotations of
first names (NameAnnotation)

I don’t have a dictionary of surnames

Everytime a matching name (a NameAnnotation) is found we
look for one or more (considering persons with double name or
surname) subsequent tokens whose PoS is “undefined” or a
noun (but not a verb) and starts with Uppercase letter

If found then the name + token(s) sequence annotates a
Person (i.e. “Michael J. Fox”)
from Persons to Actors
 Getting actors can be simple if we know that
 Persons who are also actors do some well known
 actions or there exist widely used patterns

 i.e.: a Person “stars as” CharacterInTheMovie (that
 will be eventually tagged as Person too) when is
 also an Actor

 i.e.: if the snippet “CharacterInTheMovie (Person)”
 exists, then Person is usually an Actor

 then we could build an ActorAnnotator
1. Define TypeSystem

Define at least a Type inside Type System for each
object inside the domain model

Useful to define more fine grained Types (for values of
type properties, called Features)

If we want to extract information about articles we
create an Article type inside the Type System

Also we’ll need to create annotations/entites for movies,
actors, directors, etc...
2. Define AnalysisEngine descriptor

  Define which type system it’s going to use

  Define which capabilities the analysis engine
  has: which annotations need to work and
  which annotations it’ll (eventually) generate

  Define configuration paramaters for the
  underlying algorithm

  Define resources needed by the analysis
  engine
3. Implement Annotator
 create a new class extending JCasAnnotator_ImplBase

 implement the process() method that actually does the
 job

    the algorithm implementation is (called) in the
    process() method

 you can use configuration parameters/resources defined
 in the descriptor

 eventually override initialize() and destroy() methods
DummyPersonAnnotator
4. Execute the UIMA pipeline

  Instantiate the AnalysisEngine with its
  descriptor as a parameter

  Create a CAS which will contain the text to
  be analyzed and the annotations extracted

  Run the AnalysisEngine on the given CAS

  Browse results
Execute a UIMA pipeline
What’s next


UIMA Use cases

Using UIMA in search engines

Hands on code (assignment)
References
http://www.apache.org

http://uima.apache.org

http://www.oasis-open.org

http://uima.apache.org/d/uimaj-2.3.1/index.html

http://uima.apache.org/d/uimaj-2.3.1/
overview_and_setup.html#ugr.ovv.eclipse_setup

http://www.manning.com/ingersoll/

https://github.com/tteofili/samplett/tree/master/giw1011

More Related Content

Viewers also liked

UIMA
UIMAUIMA
UIMAotisg
 
Apache UIMA and Semantic Search
Apache UIMA and Semantic SearchApache UIMA and Semantic Search
Apache UIMA and Semantic SearchTommaso Teofili
 
Shrinking the Haystack" using Solr and OpenNLP
Shrinking the Haystack" using Solr and OpenNLPShrinking the Haystack" using Solr and OpenNLP
Shrinking the Haystack" using Solr and OpenNLPlucenerevolution
 
Information Extraction with UIMA - Usecases
Information Extraction with UIMA - UsecasesInformation Extraction with UIMA - Usecases
Information Extraction with UIMA - UsecasesTommaso Teofili
 
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...Spark Summit
 
Apache UIMA - Hands on code
Apache UIMA - Hands on codeApache UIMA - Hands on code
Apache UIMA - Hands on codeTommaso Teofili
 
Apache UIMA and Metadata Generation
Apache UIMA and Metadata GenerationApache UIMA and Metadata Generation
Apache UIMA and Metadata GenerationTommaso Teofili
 
Natural Language Search in Solr
Natural Language Search in SolrNatural Language Search in Solr
Natural Language Search in SolrTommaso Teofili
 
Optimizing Multilingual Search: Presented by David Troiano, Basis Technology
Optimizing Multilingual Search: Presented by David Troiano, Basis TechnologyOptimizing Multilingual Search: Presented by David Troiano, Basis Technology
Optimizing Multilingual Search: Presented by David Troiano, Basis TechnologyLucidworks
 
SENTIment POLarity Classification Task - Sentipolc@Evalita 2014
SENTIment POLarity Classification Task - Sentipolc@Evalita 2014 SENTIment POLarity Classification Task - Sentipolc@Evalita 2014
SENTIment POLarity Classification Task - Sentipolc@Evalita 2014 University of Torino
 
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...Nicolas Kourtellis
 
GATE: a text analysis tool for social media
GATE: a text analysis tool for social mediaGATE: a text analysis tool for social media
GATE: a text analysis tool for social mediaDiana Maynard
 
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...Spark Summit
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processingrohitnayak
 
Overview - IBM Big Data Platform
Overview - IBM Big Data PlatformOverview - IBM Big Data Platform
Overview - IBM Big Data PlatformVikas Manoria
 
Big Data & Analytics Architecture
Big Data & Analytics ArchitectureBig Data & Analytics Architecture
Big Data & Analytics ArchitectureArvind Sathi
 

Viewers also liked (20)

UIMA
UIMAUIMA
UIMA
 
Apache UIMA and Semantic Search
Apache UIMA and Semantic SearchApache UIMA and Semantic Search
Apache UIMA and Semantic Search
 
Shrinking the Haystack" using Solr and OpenNLP
Shrinking the Haystack" using Solr and OpenNLPShrinking the Haystack" using Solr and OpenNLP
Shrinking the Haystack" using Solr and OpenNLP
 
Information Extraction with UIMA - Usecases
Information Extraction with UIMA - UsecasesInformation Extraction with UIMA - Usecases
Information Extraction with UIMA - Usecases
 
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
 
Apache UIMA - Hands on code
Apache UIMA - Hands on codeApache UIMA - Hands on code
Apache UIMA - Hands on code
 
Apache UIMA and Metadata Generation
Apache UIMA and Metadata GenerationApache UIMA and Metadata Generation
Apache UIMA and Metadata Generation
 
Natural Language Search in Solr
Natural Language Search in SolrNatural Language Search in Solr
Natural Language Search in Solr
 
Optimizing Multilingual Search: Presented by David Troiano, Basis Technology
Optimizing Multilingual Search: Presented by David Troiano, Basis TechnologyOptimizing Multilingual Search: Presented by David Troiano, Basis Technology
Optimizing Multilingual Search: Presented by David Troiano, Basis Technology
 
Pablo Duboue
Pablo DubouePablo Duboue
Pablo Duboue
 
SENTIment POLarity Classification Task - Sentipolc@Evalita 2014
SENTIment POLarity Classification Task - Sentipolc@Evalita 2014 SENTIment POLarity Classification Task - Sentipolc@Evalita 2014
SENTIment POLarity Classification Task - Sentipolc@Evalita 2014
 
Pycon16 draft
Pycon16 draftPycon16 draft
Pycon16 draft
 
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...
 
NLP from scratch
NLP from scratch NLP from scratch
NLP from scratch
 
GATE: a text analysis tool for social media
GATE: a text analysis tool for social mediaGATE: a text analysis tool for social media
GATE: a text analysis tool for social media
 
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
 
Rule engine
Rule engineRule engine
Rule engine
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
Overview - IBM Big Data Platform
Overview - IBM Big Data PlatformOverview - IBM Big Data Platform
Overview - IBM Big Data Platform
 
Big Data & Analytics Architecture
Big Data & Analytics ArchitectureBig Data & Analytics Architecture
Big Data & Analytics Architecture
 

Similar to Apache UIMA Introduction

Wanna search? Piece of cake!
Wanna search? Piece of cake!Wanna search? Piece of cake!
Wanna search? Piece of cake!Alex Kursov
 
Elasticsearch and Spark
Elasticsearch and SparkElasticsearch and Spark
Elasticsearch and SparkAudible, Inc.
 
Introduction to python
Introduction to pythonIntroduction to python
Introduction to pythonMohammed Rafi
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAsad Abbas
 
4.language expert rendering unicode text on ascii editor for indian languages...
4.language expert rendering unicode text on ascii editor for indian languages...4.language expert rendering unicode text on ascii editor for indian languages...
4.language expert rendering unicode text on ascii editor for indian languages...EditorJST
 
Evaluation of Research Tools
Evaluation of Research ToolsEvaluation of Research Tools
Evaluation of Research ToolsHATS
 
NLP Tasks and Applications.ppt useful in
NLP Tasks and Applications.ppt useful inNLP Tasks and Applications.ppt useful in
NLP Tasks and Applications.ppt useful inKumari Naveen
 
lect36-tasks.ppt
lect36-tasks.pptlect36-tasks.ppt
lect36-tasks.pptHaHa501620
 
Elasticsearch Analyzers Field-Level Optimization.pdf
Elasticsearch Analyzers Field-Level Optimization.pdfElasticsearch Analyzers Field-Level Optimization.pdf
Elasticsearch Analyzers Field-Level Optimization.pdfInexture Solutions
 
Compiler_Project_Srikanth_Vanama
Compiler_Project_Srikanth_VanamaCompiler_Project_Srikanth_Vanama
Compiler_Project_Srikanth_VanamaSrikanth Vanama
 
Deep Dive into Apache MXNet on AWS
Deep Dive into Apache MXNet on AWSDeep Dive into Apache MXNet on AWS
Deep Dive into Apache MXNet on AWSKristana Kane
 
Elasticsearch Basics
Elasticsearch BasicsElasticsearch Basics
Elasticsearch BasicsShifa Khan
 
role of lexical anaysis
role of lexical anaysisrole of lexical anaysis
role of lexical anaysisSudhaa Ravi
 
Osonto documentatie
Osonto documentatieOsonto documentatie
Osonto documentatiewondernet
 
Day2-Slides.ppt pppppppppppppppppppppppppp
Day2-Slides.ppt ppppppppppppppppppppppppppDay2-Slides.ppt pppppppppppppppppppppppppp
Day2-Slides.ppt ppppppppppppppppppppppppppratnapatil14
 
Discovering User's Topics of Interest in Recommender Systems
Discovering User's Topics of Interest in Recommender SystemsDiscovering User's Topics of Interest in Recommender Systems
Discovering User's Topics of Interest in Recommender SystemsGabriel Moreira
 
Lucene Bootcamp -1
Lucene Bootcamp -1 Lucene Bootcamp -1
Lucene Bootcamp -1 GokulD
 

Similar to Apache UIMA Introduction (20)

AI & ML
AI & MLAI & ML
AI & ML
 
Wanna search? Piece of cake!
Wanna search? Piece of cake!Wanna search? Piece of cake!
Wanna search? Piece of cake!
 
Elasticsearch and Spark
Elasticsearch and SparkElasticsearch and Spark
Elasticsearch and Spark
 
Introduction to python
Introduction to pythonIntroduction to python
Introduction to python
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using Lucene
 
4.language expert rendering unicode text on ascii editor for indian languages...
4.language expert rendering unicode text on ascii editor for indian languages...4.language expert rendering unicode text on ascii editor for indian languages...
4.language expert rendering unicode text on ascii editor for indian languages...
 
Evaluation of Research Tools
Evaluation of Research ToolsEvaluation of Research Tools
Evaluation of Research Tools
 
NLP Tasks and Applications.ppt useful in
NLP Tasks and Applications.ppt useful inNLP Tasks and Applications.ppt useful in
NLP Tasks and Applications.ppt useful in
 
lect36-tasks.ppt
lect36-tasks.pptlect36-tasks.ppt
lect36-tasks.ppt
 
Plc part 2
Plc  part 2Plc  part 2
Plc part 2
 
Elasticsearch Analyzers Field-Level Optimization.pdf
Elasticsearch Analyzers Field-Level Optimization.pdfElasticsearch Analyzers Field-Level Optimization.pdf
Elasticsearch Analyzers Field-Level Optimization.pdf
 
Compiler_Project_Srikanth_Vanama
Compiler_Project_Srikanth_VanamaCompiler_Project_Srikanth_Vanama
Compiler_Project_Srikanth_Vanama
 
Parser
ParserParser
Parser
 
Deep Dive into Apache MXNet on AWS
Deep Dive into Apache MXNet on AWSDeep Dive into Apache MXNet on AWS
Deep Dive into Apache MXNet on AWS
 
Elasticsearch Basics
Elasticsearch BasicsElasticsearch Basics
Elasticsearch Basics
 
role of lexical anaysis
role of lexical anaysisrole of lexical anaysis
role of lexical anaysis
 
Osonto documentatie
Osonto documentatieOsonto documentatie
Osonto documentatie
 
Day2-Slides.ppt pppppppppppppppppppppppppp
Day2-Slides.ppt ppppppppppppppppppppppppppDay2-Slides.ppt pppppppppppppppppppppppppp
Day2-Slides.ppt pppppppppppppppppppppppppp
 
Discovering User's Topics of Interest in Recommender Systems
Discovering User's Topics of Interest in Recommender SystemsDiscovering User's Topics of Interest in Recommender Systems
Discovering User's Topics of Interest in Recommender Systems
 
Lucene Bootcamp -1
Lucene Bootcamp -1 Lucene Bootcamp -1
Lucene Bootcamp -1
 

More from Tommaso Teofili

Affect Enriched Word Embeddings for News IR
Affect Enriched Word Embeddings for News IRAffect Enriched Word Embeddings for News IR
Affect Enriched Word Embeddings for News IRTommaso Teofili
 
Flexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit OakFlexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit OakTommaso Teofili
 
Data replication in Sling
Data replication in SlingData replication in Sling
Data replication in SlingTommaso Teofili
 
Search engines in the industry
Search engines in the industrySearch engines in the industry
Search engines in the industryTommaso Teofili
 
Scaling search in Oak with Solr
Scaling search in Oak with Solr Scaling search in Oak with Solr
Scaling search in Oak with Solr Tommaso Teofili
 
Text categorization with Lucene and Solr
Text categorization with Lucene and SolrText categorization with Lucene and Solr
Text categorization with Lucene and SolrTommaso Teofili
 
Machine learning with Apache Hama
Machine learning with Apache HamaMachine learning with Apache Hama
Machine learning with Apache HamaTommaso Teofili
 
Adapting Apache UIMA to OSGi
Adapting Apache UIMA to OSGiAdapting Apache UIMA to OSGi
Adapting Apache UIMA to OSGiTommaso Teofili
 
Domeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and ClerezzaDomeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and ClerezzaTommaso Teofili
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash courseTommaso Teofili
 
OSS Enterprise Search EU Tour
OSS Enterprise Search EU TourOSS Enterprise Search EU Tour
OSS Enterprise Search EU TourTommaso Teofili
 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platformTommaso Teofili
 
Data and Information Extraction on the Web
Data and Information Extraction on the WebData and Information Extraction on the Web
Data and Information Extraction on the WebTommaso Teofili
 

More from Tommaso Teofili (14)

Affect Enriched Word Embeddings for News IR
Affect Enriched Word Embeddings for News IRAffect Enriched Word Embeddings for News IR
Affect Enriched Word Embeddings for News IR
 
Flexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit OakFlexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit Oak
 
Data replication in Sling
Data replication in SlingData replication in Sling
Data replication in Sling
 
Search engines in the industry
Search engines in the industrySearch engines in the industry
Search engines in the industry
 
Scaling search in Oak with Solr
Scaling search in Oak with Solr Scaling search in Oak with Solr
Scaling search in Oak with Solr
 
Text categorization with Lucene and Solr
Text categorization with Lucene and SolrText categorization with Lucene and Solr
Text categorization with Lucene and Solr
 
Machine learning with Apache Hama
Machine learning with Apache HamaMachine learning with Apache Hama
Machine learning with Apache Hama
 
Adapting Apache UIMA to OSGi
Adapting Apache UIMA to OSGiAdapting Apache UIMA to OSGi
Adapting Apache UIMA to OSGi
 
Oak / Solr integration
Oak / Solr integrationOak / Solr integration
Oak / Solr integration
 
Domeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and ClerezzaDomeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and Clerezza
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 
OSS Enterprise Search EU Tour
OSS Enterprise Search EU TourOSS Enterprise Search EU Tour
OSS Enterprise Search EU Tour
 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platform
 
Data and Information Extraction on the Web
Data and Information Extraction on the WebData and Information Extraction on the Web
Data and Information Extraction on the Web
 

Recently uploaded

A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 

Recently uploaded (20)

A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 

Apache UIMA Introduction

  • 1. Apache UIMA Introduction Gestione delle Informazioni su Web - 2010/2011 Tommaso Teofili tommaso [at] apache [dot] org
  • 2. UIM ? Unstructured Information Management A wide topic: text, audio, video Different (possibly mixed) approaches (NLP, Machine Learning, IR, Ontologies, Automated reasoning, Knowledge Sources) Apache UIMA
  • 3. Apache Software Foundation No profit corporation “...provides organizational, legal, and financial support for a broad range of open source software projects...” “...collaborative and meritocratic development process...” “...pragmatic Apache License...”
  • 4. Apache UIMA Architectural framework to manage unstructured data (Java, C++, ...) Former IBM research project donated to ASF OASIS Standard for unstructured information management
  • 5. Apache UIMA - Goals “Our goal is to support a thriving community of users and developers of UIMA frameworks, tools, and annotators, facilitating the analysis of unstructured content such as text, audio and video”
  • 6. Apache UIMA - bridging worlds
  • 7. Apache UIMA - Overview UIMA supports the development, discovery, composition and deployment of multi-modal analytics for the analysis of unstructured information and its integration with search technologies
  • 8. Apache UIMA - Multimodal Analysis Multimodal Analysis means the ability of processing some resource from various “points of view” Sample: a video stream for which we want to extract subtitles and also automatically recognize the actors involved We are though mainly interested in text...
  • 9. Sample scenario Content Management System containing free text articles about movies We want such articles to be automatically enriched with metadata contained inside the text (movies, directors, actors/actresses, distribution) and linked to “similar” articles (i.e.: dealing with same movies or actors) So that we can search for “similar” articles
  • 10. Sample scenario - articles about movies
  • 11. Sample scenario UIMA can help on enriching articles with metadata Think of filling an Article.java instance variables with proper values Then persisting it to a database to query articles dealing with the same actors
  • 13. Sample scenario - metadata
  • 15. Apache UIMA - Annotation The association of a metadata, such as a label, with a region of text (or other type of artifact). For example, the label “Person” associated with a region of text “Fred Center” constitutes an annotation. We say “Person” annotates the span of text from X to Y containing exactly “Fred Center”
  • 16. Apache UIMA - Basic Steps Domain model definition Analysis pipeline definition Arrange components: Define components draining data from sources Add and customize analysis components: Patterns, Dictionaries, RegEx, External services, NLP, etc... Define components outputting information on target storages Analysis pipeline(s) execution
  • 17. Defining domain model within UIMA using Type Systems Type System is the place where we describe which metadata we would like to extract Low representational gap Like almost everything in UIMA: described (and generated!) using XML Possible to define multiple Type Systems for different purposes
  • 18. How do UIMA extract metadata?
  • 19. Apache UIMA - Analysis Engines Basic UIMA building blocks Analyze a document Infer and record descriptive attributes (about documents/regions) Generating analysis results
  • 20. Apache UIMA - AEs Analysis Engines are described by a descriptor (XML) Can be Primitive (a single AE) or Aggregated (a pipeline of AEs) Analysis algorithms can be switched changing descriptor instead of code Contain TypeSystems definitions Define Capabilites
  • 21. Apache UIMA - AnalysisComponent API initialize : Performs (once) any startup tasks required by this component process : Process the resource to analyze generating analysis results (metadata) destroy : Frees all resources held, called only once when it is finished using this component
  • 22. Apache UIMA - Annotators Analysis Engine algorithm Annotator : A software component implemented to produce and record annotations over regions of an artifact (e.g., text document, audio, and video) Annotators implement AnalysisComponent interface
  • 23. Apache UIMA - Roles AnalysisEngine : High level block responsible for analysis - contains at least one AnalysisComponent AnalysisComponent : interface for any component responsible for analyzing artifacts Annotator : implementation of AnalysisComponent responsible for creating Annotations
  • 25. Analysis Engines in a Pipeline
  • 26. Apache UIMA - Analysis Results Where do analysis results end up? How annotators represent and share their results? CAS - Common Analysis Structure Maintain typed indexes of extracted results
  • 28. Which algorithms lay under AEs?
  • 29. Apache UIMA & NLP NLP (Natural Language Processing) is a theoretically motivated range of computational techniques for analyzing and representing naturally occurring texts at one or more levels of linguistic analysis for the purpose of achieving human-like language processing for a range of tasks or applications It’s an AI discipline
  • 30. Apache UIMA & NLP “accomplish human-like language processing” Paraphrase an input text Translate the text into another language Answer questions about the contents of the text Draw inferences from the text
  • 31. Apache UIMA & NLP “an NLP-based IR system has the goal of providing more precise, complete information in response to a user’s real information need” various levels of processing
  • 32. Apache UIMA - Approaches Simplest : Write RegEx and Dictionaries and mix them together NLP-like : Tokenize -> Sentence identification -> PoS Tagging -> Anaphora resolution -> Named Entities Recognition -> Coreference Identification ...
  • 33. Analysis Engines in a Pipeline
  • 34. NLP - Language Identifying NLP takes advantage of language specific syntax, forms, rules and meanings Not easy to write language independent extraction algorithms Often this is the first block of NLP pipelines Techniques: Stopwords dictionaries, statistical models, etc.
  • 35. NLP - Tokens and Sentences Humans learn words’ meaning in order to understand whole context semantics Split the target text in words to be able to analyze their meaning and role Discover sentences to later assign roles to each token Easiest for English, Italian & co. but what about Chinese?
  • 36. NLP - PoS Tagging Assign a “Part of Speech” (noun, adjective, verb, etc.) to each token generated in the previous step Many language/domain specific patterns can be discovered and exploited just with pos- tagged-tokens and sentences
  • 37. NLP - Chunking & Parsing Parse sentences into a meaningful set or tree of relationships Chunks are the sentence building blocks (i.e. verbal forms) Parse tree highlights the structure of a sentence Can leverage logic analysis chunking parsing
  • 38. NLP - Named Entities Recognition Answer the questions: where? when? who? how often? how much? Identify key entities in the text Common techniques: dictionaries, rules, statistcal models
  • 40. Using UIMA Define TypeSystem Define AnalysisEngine descriptor(s) Implement Annotator(s) Execute the UIMA pipeline
  • 41. Sample scenario - extract actors Tokenize article text Identify sentences Tag PoS Identify Persons using regular expressions and PoS Use Person annotations, Tokens’ PoS and Sentences to extract relations between terms to identify Persons who are also Actors
  • 42. Sample scenario - extract persons I have a dictionary of names (simple to find and/or build) I use a dictionary based Annotator to extract annotations of first names (NameAnnotation) I don’t have a dictionary of surnames Everytime a matching name (a NameAnnotation) is found we look for one or more (considering persons with double name or surname) subsequent tokens whose PoS is “undefined” or a noun (but not a verb) and starts with Uppercase letter If found then the name + token(s) sequence annotates a Person (i.e. “Michael J. Fox”)
  • 43. from Persons to Actors Getting actors can be simple if we know that Persons who are also actors do some well known actions or there exist widely used patterns i.e.: a Person “stars as” CharacterInTheMovie (that will be eventually tagged as Person too) when is also an Actor i.e.: if the snippet “CharacterInTheMovie (Person)” exists, then Person is usually an Actor then we could build an ActorAnnotator
  • 44. 1. Define TypeSystem Define at least a Type inside Type System for each object inside the domain model Useful to define more fine grained Types (for values of type properties, called Features) If we want to extract information about articles we create an Article type inside the Type System Also we’ll need to create annotations/entites for movies, actors, directors, etc...
  • 45. 2. Define AnalysisEngine descriptor Define which type system it’s going to use Define which capabilities the analysis engine has: which annotations need to work and which annotations it’ll (eventually) generate Define configuration paramaters for the underlying algorithm Define resources needed by the analysis engine
  • 46. 3. Implement Annotator create a new class extending JCasAnnotator_ImplBase implement the process() method that actually does the job the algorithm implementation is (called) in the process() method you can use configuration parameters/resources defined in the descriptor eventually override initialize() and destroy() methods
  • 48. 4. Execute the UIMA pipeline Instantiate the AnalysisEngine with its descriptor as a parameter Create a CAS which will contain the text to be analyzed and the annotations extracted Run the AnalysisEngine on the given CAS Browse results
  • 49. Execute a UIMA pipeline
  • 50. What’s next UIMA Use cases Using UIMA in search engines Hands on code (assignment)