SlideShare a Scribd company logo
1 of 53
Download to read offline
Information Extraction
                     with UIMA - Use Cases
                         Gestione delle Informazioni su Web - 2009/2010
                                          Tommaso Teofili
                                  tommaso [at] apache [dot] org




venerdì 16 aprile 2010
Use Cases - Agenda


                         UC1 : Real Estatate market analysis

                         UC2 : Tenders automatic information
                         extraction




venerdì 16 aprile 2010
UC1 : Source


                         An online announcement site for sellers and
                         buyers

                         Wide purpose (cars, RE, hi-fi, etc...)

                         Local scope (Rome and nearby)




venerdì 16 aprile 2010
UC1 - Goals

                         Are you looking for houses?

                         A specified subcategory of the site is dedicated to
                         real estate

                         I would like to monitor Rome real estate market to

                           Take smart decisions

                           Predict how things will go in the (near) future



venerdì 16 aprile 2010
UC1 - Source
venerdì 16 aprile 2010
UC1 - Goals
                         I want to build a separate web application to
                         monitor such estate listings

                         I have to use a crawler to automatically
                         download selected pages periodically from the
                         source

                         Estate listings text is unstructered

                         I want to make aggregate queries on structured
                         information


venerdì 16 aprile 2010
UC1 - Information
                              Extraction


                         I have to write an information extraction
                         engine to populate a relational schema DB
                         with structured information from free text
                         of estate listings




venerdì 16 aprile 2010
UC1 - Blocks
venerdì 16 aprile 2010
UC1 - Crawler


                         A specialized crawler extract data from the
                         source

                         Estate listings data are stored grouped by
                         zones in files on some directory on a
                         managed machine




venerdì 16 aprile 2010
UC1 - Crawler

                         Define navigation of the site using one XML
                         for each city zone

                         The crawler downloads page fragments two
                         times a week

                         The estate listings extracted free text is
                         saved on XML grouped by zone



venerdì 16 aprile 2010
UC1 - Crawler Modules
venerdì 16 aprile 2010
UC1 - navigation definition
venerdì 16 aprile 2010
UC1 - Crawler

                         Issues :

                           Enabled cookies

                           Some HTTP headers needed

                           Needed to put fixed sleeping intervals
                           between requests



venerdì 16 aprile 2010
UC1 - Domain


                         EstateListing (Announcement)

                         Zone

                         MagazineNumber (Uscita)

                         HouseStructure with properties




venerdì 16 aprile 2010
UC1 - Information
                           Extraction Engine
                         Goal : extract price, zone and telephone
                         number

                         The first version contained a specialized IE
                         engine which used huge regular expressions

                         Hard to maintain and unefficient

                         Extracting not so much information



venerdì 16 aprile 2010
UC1 - IE Engine

                         New requirement: extract also the structure
                         of the house

                         Number of rooms, box, garden(s), external
                         spaces, number of bathrooms, kitchen, etc...

                         Using again RegEx resulted to be hard to
                         maintain and unefficient



venerdì 16 aprile 2010
UC1 - IE Engine
                         Subsitute the RegEx based IE engine with a UIMA
                         based IE engine to:

                           exploit previous work (RegExs can live inside UIMA
                           too)

                           exploit existing components

                           be able to modify and enhanche IE rules easily

                           much more efficient

                           more information extracted


venerdì 16 aprile 2010
UC1 - Analysis pipeline
venerdì 16 aprile 2010
UC1 - TypeSystem
venerdì 16 aprile 2010
Crawled XML
venerdì 16 aprile 2010
Sample text


                         “ven 26 Dic APPIA via grottaferrata metro 2
                         ¡ piano assolato ingresso salone americana
                         cucina camera cameretta bagno soppalco
                         posto auto e 295.000”




venerdì 16 aprile 2010
UC1 - ContentAnnotator

                         From the XML produced by the crawler only
                         estate listings must be extracted

                         A simple parser to get each node containing
                         an estate listing (that in turn will be
                         unstructured)

                         Create a ContentAnnotation over the
                         document



venerdì 16 aprile 2010
UC1 - ContentAnnotator
venerdì 16 aprile 2010
ContentAnnotation
venerdì 16 aprile 2010
UC1 - ACAnnotator
venerdì 16 aprile 2010
UC1 - Entities
venerdì 16 aprile 2010
ZoneAnnotator - Dictionary &
                              RegEx
venerdì 16 aprile 2010
ZoneAnnotator - Learning
                               dictionaries
venerdì 16 aprile 2010
UC1 - ZoneAnnotation
venerdì 16 aprile 2010
UC1 - Consuming
                         extracted information
                         the previous version of the IE engine
                         produced (again) XMLs that needed to be
                         parsed to store structured data inside the
                         DB

                         with UIMA a CAS Consumer at the end of
                         the analysis pipeline can automatically put
                         extracted information on the DB



venerdì 16 aprile 2010
UC1 - Analyzing real
                          estate market data

                         a simple webapp written in Java with Spring
                         framework modules (Spring core, DAO, JDBC,
                         MVC) querying aggregate data on MySQL DB

                         enriched UI with JQuery




venerdì 16 aprile 2010
UC1 - Analysis Graphs
venerdì 16 aprile 2010
UC1 - Analysis Graphs
venerdì 16 aprile 2010
UC2 - Monitor of
                     tenders/announcements
                         Monitor various sources which provide
                         announcement and tenders to which people
                         and companies are interested can subscribe

                         We want to automate the long monitoring
                         process of such sources and also
                         automatically extract useful common
                         information from announcements’ text



venerdì 16 aprile 2010
UC2 - Blocks
venerdì 16 aprile 2010
Different input texts
venerdì 16 aprile 2010
Different input texts
venerdì 16 aprile 2010
Different input texts
venerdì 16 aprile 2010
Different input texts
venerdì 16 aprile 2010
UC2 - Crawling
                         Similar to UC1 Crawler but using a Firefox
                         plugin we can define navigation patterns for
                         pages of each source

                         We can also define metadata we see during
                         navigation that deliver information

                         Again an XML will be generated so that it
                         can be saved on a storage and executed
                         periodically


venerdì 16 aprile 2010
UC2 - Defining navigation
venerdì 16 aprile 2010
UC2 - Domain
                                     annotations
                         Language           Funding type

                         Abstract           Geographic region

                         Activity           Sector

                         Beneficiary         Subject

                         Budget             Title

                         Expiration date    Tags



venerdì 16 aprile 2010
UC2 - Domain entities


                         First and most important is an entity that
                         represents the entire tender or
                         announcement

                         Annotations inside the domain will finally fill
                         such entity properties




venerdì 16 aprile 2010
UC2 - Pipeline
venerdì 16 aprile 2010
UC2 - Simple first

                         Each annotator first looks:

                            if some metadata was extracted during navigation

                            for the most common pattern for defining
                            information inside such announcements

                         i.e.: “Budget: 200000$” or “Financial information: ......”

                         Such patterns are language independent (although
                         this is often not true)



venerdì 16 aprile 2010
UC2 - AbstractAnnotator
                         The abstract is usually in the first part of the
                         document

                         We use Tokenizer and Tagger to get Tokens (with
                         PoS tags) and Sentences

                         We use Dictionary to provide a list of “good”
                         words

                         We look in the first sentences of the document
                         looking for objectives of the announcement
                         (mixing good words and regular expressions)


venerdì 16 aprile 2010
UC2 -
                    ExpirationDateAnnotator

                         A DateAnnotator is executed before

                         Iterate over DateAnnotations

                         Get sentences wrapping such DateAnnotations

                         Check if some terms like “deadline” appear in
                         the same sentence of a DateAnnotation



venerdì 16 aprile 2010
Date patterns
venerdì 16 aprile 2010
ExpirationDateAnnotator
venerdì 16 aprile 2010
GeographicRegionAnnotator
venerdì 16 aprile 2010
UC2 - ActivityAnnotator
venerdì 16 aprile 2010
UC2 - ActivityAnnotator
venerdì 16 aprile 2010
Conclusions on IE
                         UC1 : simple and stable sentence patterns

                         UC2 : multi language, much more complex
                         and different sentence structures and
                         patterns

                         Fine grain metadata are very important

                         Need to play with NLP

                         Need to establish good test cases


venerdì 16 aprile 2010

More Related Content

Viewers also liked

OUTDATED Text Mining 5/5: Information Extraction
OUTDATED Text Mining 5/5: Information ExtractionOUTDATED Text Mining 5/5: Information Extraction
OUTDATED Text Mining 5/5: Information ExtractionFlorian Leitner
 
Textmining Information Extraction
Textmining Information ExtractionTextmining Information Extraction
Textmining Information Extractionguest0edcaf
 
IRE- Algorithm Name Detection in Research Papers
IRE- Algorithm Name Detection in Research PapersIRE- Algorithm Name Detection in Research Papers
IRE- Algorithm Name Detection in Research PapersSriTeja Allaparthi
 
Multimodal Information Extraction: Disease, Date and Location Retrieval
Multimodal Information Extraction: Disease, Date and Location RetrievalMultimodal Information Extraction: Disease, Date and Location Retrieval
Multimodal Information Extraction: Disease, Date and Location RetrievalSvitlana volkova
 
Mining Product Synonyms - Slides
Mining Product Synonyms - SlidesMining Product Synonyms - Slides
Mining Product Synonyms - SlidesAnkush Jain
 
Web Information Extraction Learning based on Probabilistic Graphical Models
Web Information Extraction Learning based on Probabilistic Graphical ModelsWeb Information Extraction Learning based on Probabilistic Graphical Models
Web Information Extraction Learning based on Probabilistic Graphical ModelsGUANBO
 
Group-13 Project 15 Sub event detection on social media
Group-13 Project 15 Sub event detection on social mediaGroup-13 Project 15 Sub event detection on social media
Group-13 Project 15 Sub event detection on social mediaAhmedali Durga
 
System for-health-diagnosis
System for-health-diagnosisSystem for-health-diagnosis
System for-health-diagnosisask2372
 
A survey of_eigenvector_methods_for_web_information_retrieval
A survey of_eigenvector_methods_for_web_information_retrievalA survey of_eigenvector_methods_for_web_information_retrieval
A survey of_eigenvector_methods_for_web_information_retrievalChen Xi
 
Information_retrieval_and_extraction_IIIT
Information_retrieval_and_extraction_IIITInformation_retrieval_and_extraction_IIIT
Information_retrieval_and_extraction_IIITAnkit Sharma
 
Information extraction for Free Text
Information extraction for Free TextInformation extraction for Free Text
Information extraction for Free Textbutest
 
Open Information Extraction 2nd
Open Information Extraction 2ndOpen Information Extraction 2nd
Open Information Extraction 2ndhit_alex
 
Information Retrieval and Extraction
Information Retrieval and ExtractionInformation Retrieval and Extraction
Information Retrieval and ExtractionChristopher Frenz
 
Algorithm Name Detection & Extraction
Algorithm Name Detection & ExtractionAlgorithm Name Detection & Extraction
Algorithm Name Detection & ExtractionDeeksha thakur
 
ATI Courses Professional Development Short Course Remote Sensing Information ...
ATI Courses Professional Development Short Course Remote Sensing Information ...ATI Courses Professional Development Short Course Remote Sensing Information ...
ATI Courses Professional Development Short Course Remote Sensing Information ...Jim Jenkins
 
Information Extraction from the Web - Algorithms and Tools
Information Extraction from the Web - Algorithms and ToolsInformation Extraction from the Web - Algorithms and Tools
Information Extraction from the Web - Algorithms and ToolsBenjamin Habegger
 
Enterprise information extraction: recent developments and open challenges
Enterprise information extraction: recent developments and open challengesEnterprise information extraction: recent developments and open challenges
Enterprise information extraction: recent developments and open challengesYunyao Li
 

Viewers also liked (20)

OUTDATED Text Mining 5/5: Information Extraction
OUTDATED Text Mining 5/5: Information ExtractionOUTDATED Text Mining 5/5: Information Extraction
OUTDATED Text Mining 5/5: Information Extraction
 
Textmining Information Extraction
Textmining Information ExtractionTextmining Information Extraction
Textmining Information Extraction
 
Web Information Retrieval and Mining
Web Information Retrieval and MiningWeb Information Retrieval and Mining
Web Information Retrieval and Mining
 
IRE- Algorithm Name Detection in Research Papers
IRE- Algorithm Name Detection in Research PapersIRE- Algorithm Name Detection in Research Papers
IRE- Algorithm Name Detection in Research Papers
 
Multimodal Information Extraction: Disease, Date and Location Retrieval
Multimodal Information Extraction: Disease, Date and Location RetrievalMultimodal Information Extraction: Disease, Date and Location Retrieval
Multimodal Information Extraction: Disease, Date and Location Retrieval
 
[EN] Capture Indexing & Auto-Classification | DLM Forum Industry Whitepaper 0...
[EN] Capture Indexing & Auto-Classification | DLM Forum Industry Whitepaper 0...[EN] Capture Indexing & Auto-Classification | DLM Forum Industry Whitepaper 0...
[EN] Capture Indexing & Auto-Classification | DLM Forum Industry Whitepaper 0...
 
Mining Product Synonyms - Slides
Mining Product Synonyms - SlidesMining Product Synonyms - Slides
Mining Product Synonyms - Slides
 
Web Information Extraction Learning based on Probabilistic Graphical Models
Web Information Extraction Learning based on Probabilistic Graphical ModelsWeb Information Extraction Learning based on Probabilistic Graphical Models
Web Information Extraction Learning based on Probabilistic Graphical Models
 
Group-13 Project 15 Sub event detection on social media
Group-13 Project 15 Sub event detection on social mediaGroup-13 Project 15 Sub event detection on social media
Group-13 Project 15 Sub event detection on social media
 
System for-health-diagnosis
System for-health-diagnosisSystem for-health-diagnosis
System for-health-diagnosis
 
A survey of_eigenvector_methods_for_web_information_retrieval
A survey of_eigenvector_methods_for_web_information_retrievalA survey of_eigenvector_methods_for_web_information_retrieval
A survey of_eigenvector_methods_for_web_information_retrieval
 
Information_retrieval_and_extraction_IIIT
Information_retrieval_and_extraction_IIITInformation_retrieval_and_extraction_IIIT
Information_retrieval_and_extraction_IIIT
 
Information extraction for Free Text
Information extraction for Free TextInformation extraction for Free Text
Information extraction for Free Text
 
Open Information Extraction 2nd
Open Information Extraction 2ndOpen Information Extraction 2nd
Open Information Extraction 2nd
 
Information Retrieval and Extraction
Information Retrieval and ExtractionInformation Retrieval and Extraction
Information Retrieval and Extraction
 
Algorithm Name Detection & Extraction
Algorithm Name Detection & ExtractionAlgorithm Name Detection & Extraction
Algorithm Name Detection & Extraction
 
ATI Courses Professional Development Short Course Remote Sensing Information ...
ATI Courses Professional Development Short Course Remote Sensing Information ...ATI Courses Professional Development Short Course Remote Sensing Information ...
ATI Courses Professional Development Short Course Remote Sensing Information ...
 
2 13
2 132 13
2 13
 
Information Extraction from the Web - Algorithms and Tools
Information Extraction from the Web - Algorithms and ToolsInformation Extraction from the Web - Algorithms and Tools
Information Extraction from the Web - Algorithms and Tools
 
Enterprise information extraction: recent developments and open challenges
Enterprise information extraction: recent developments and open challengesEnterprise information extraction: recent developments and open challenges
Enterprise information extraction: recent developments and open challenges
 

More from Tommaso Teofili

Affect Enriched Word Embeddings for News IR
Affect Enriched Word Embeddings for News IRAffect Enriched Word Embeddings for News IR
Affect Enriched Word Embeddings for News IRTommaso Teofili
 
Flexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit OakFlexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit OakTommaso Teofili
 
Data replication in Sling
Data replication in SlingData replication in Sling
Data replication in SlingTommaso Teofili
 
Search engines in the industry
Search engines in the industrySearch engines in the industry
Search engines in the industryTommaso Teofili
 
Scaling search in Oak with Solr
Scaling search in Oak with Solr Scaling search in Oak with Solr
Scaling search in Oak with Solr Tommaso Teofili
 
Text categorization with Lucene and Solr
Text categorization with Lucene and SolrText categorization with Lucene and Solr
Text categorization with Lucene and SolrTommaso Teofili
 
Machine learning with Apache Hama
Machine learning with Apache HamaMachine learning with Apache Hama
Machine learning with Apache HamaTommaso Teofili
 
Adapting Apache UIMA to OSGi
Adapting Apache UIMA to OSGiAdapting Apache UIMA to OSGi
Adapting Apache UIMA to OSGiTommaso Teofili
 
Domeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and ClerezzaDomeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and ClerezzaTommaso Teofili
 
Natural Language Search in Solr
Natural Language Search in SolrNatural Language Search in Solr
Natural Language Search in SolrTommaso Teofili
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash courseTommaso Teofili
 
OSS Enterprise Search EU Tour
OSS Enterprise Search EU TourOSS Enterprise Search EU Tour
OSS Enterprise Search EU TourTommaso Teofili
 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platformTommaso Teofili
 

More from Tommaso Teofili (14)

Affect Enriched Word Embeddings for News IR
Affect Enriched Word Embeddings for News IRAffect Enriched Word Embeddings for News IR
Affect Enriched Word Embeddings for News IR
 
Flexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit OakFlexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit Oak
 
Data replication in Sling
Data replication in SlingData replication in Sling
Data replication in Sling
 
Search engines in the industry
Search engines in the industrySearch engines in the industry
Search engines in the industry
 
Scaling search in Oak with Solr
Scaling search in Oak with Solr Scaling search in Oak with Solr
Scaling search in Oak with Solr
 
Text categorization with Lucene and Solr
Text categorization with Lucene and SolrText categorization with Lucene and Solr
Text categorization with Lucene and Solr
 
Machine learning with Apache Hama
Machine learning with Apache HamaMachine learning with Apache Hama
Machine learning with Apache Hama
 
Adapting Apache UIMA to OSGi
Adapting Apache UIMA to OSGiAdapting Apache UIMA to OSGi
Adapting Apache UIMA to OSGi
 
Oak / Solr integration
Oak / Solr integrationOak / Solr integration
Oak / Solr integration
 
Domeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and ClerezzaDomeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and Clerezza
 
Natural Language Search in Solr
Natural Language Search in SolrNatural Language Search in Solr
Natural Language Search in Solr
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 
OSS Enterprise Search EU Tour
OSS Enterprise Search EU TourOSS Enterprise Search EU Tour
OSS Enterprise Search EU Tour
 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platform
 

Recently uploaded

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 

Recently uploaded (20)

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 

Information Extraction with UIMA - Usecases

  • 1. Information Extraction with UIMA - Use Cases Gestione delle Informazioni su Web - 2009/2010 Tommaso Teofili tommaso [at] apache [dot] org venerdì 16 aprile 2010
  • 2. Use Cases - Agenda UC1 : Real Estatate market analysis UC2 : Tenders automatic information extraction venerdì 16 aprile 2010
  • 3. UC1 : Source An online announcement site for sellers and buyers Wide purpose (cars, RE, hi-fi, etc...) Local scope (Rome and nearby) venerdì 16 aprile 2010
  • 4. UC1 - Goals Are you looking for houses? A specified subcategory of the site is dedicated to real estate I would like to monitor Rome real estate market to Take smart decisions Predict how things will go in the (near) future venerdì 16 aprile 2010
  • 5. UC1 - Source venerdì 16 aprile 2010
  • 6. UC1 - Goals I want to build a separate web application to monitor such estate listings I have to use a crawler to automatically download selected pages periodically from the source Estate listings text is unstructered I want to make aggregate queries on structured information venerdì 16 aprile 2010
  • 7. UC1 - Information Extraction I have to write an information extraction engine to populate a relational schema DB with structured information from free text of estate listings venerdì 16 aprile 2010
  • 8. UC1 - Blocks venerdì 16 aprile 2010
  • 9. UC1 - Crawler A specialized crawler extract data from the source Estate listings data are stored grouped by zones in files on some directory on a managed machine venerdì 16 aprile 2010
  • 10. UC1 - Crawler Define navigation of the site using one XML for each city zone The crawler downloads page fragments two times a week The estate listings extracted free text is saved on XML grouped by zone venerdì 16 aprile 2010
  • 11. UC1 - Crawler Modules venerdì 16 aprile 2010
  • 12. UC1 - navigation definition venerdì 16 aprile 2010
  • 13. UC1 - Crawler Issues : Enabled cookies Some HTTP headers needed Needed to put fixed sleeping intervals between requests venerdì 16 aprile 2010
  • 14. UC1 - Domain EstateListing (Announcement) Zone MagazineNumber (Uscita) HouseStructure with properties venerdì 16 aprile 2010
  • 15. UC1 - Information Extraction Engine Goal : extract price, zone and telephone number The first version contained a specialized IE engine which used huge regular expressions Hard to maintain and unefficient Extracting not so much information venerdì 16 aprile 2010
  • 16. UC1 - IE Engine New requirement: extract also the structure of the house Number of rooms, box, garden(s), external spaces, number of bathrooms, kitchen, etc... Using again RegEx resulted to be hard to maintain and unefficient venerdì 16 aprile 2010
  • 17. UC1 - IE Engine Subsitute the RegEx based IE engine with a UIMA based IE engine to: exploit previous work (RegExs can live inside UIMA too) exploit existing components be able to modify and enhanche IE rules easily much more efficient more information extracted venerdì 16 aprile 2010
  • 18. UC1 - Analysis pipeline venerdì 16 aprile 2010
  • 19. UC1 - TypeSystem venerdì 16 aprile 2010
  • 20. Crawled XML venerdì 16 aprile 2010
  • 21. Sample text “ven 26 Dic APPIA via grottaferrata metro 2 ¡ piano assolato ingresso salone americana cucina camera cameretta bagno soppalco posto auto e 295.000” venerdì 16 aprile 2010
  • 22. UC1 - ContentAnnotator From the XML produced by the crawler only estate listings must be extracted A simple parser to get each node containing an estate listing (that in turn will be unstructured) Create a ContentAnnotation over the document venerdì 16 aprile 2010
  • 25. UC1 - ACAnnotator venerdì 16 aprile 2010
  • 26. UC1 - Entities venerdì 16 aprile 2010
  • 27. ZoneAnnotator - Dictionary & RegEx venerdì 16 aprile 2010
  • 28. ZoneAnnotator - Learning dictionaries venerdì 16 aprile 2010
  • 30. UC1 - Consuming extracted information the previous version of the IE engine produced (again) XMLs that needed to be parsed to store structured data inside the DB with UIMA a CAS Consumer at the end of the analysis pipeline can automatically put extracted information on the DB venerdì 16 aprile 2010
  • 31. UC1 - Analyzing real estate market data a simple webapp written in Java with Spring framework modules (Spring core, DAO, JDBC, MVC) querying aggregate data on MySQL DB enriched UI with JQuery venerdì 16 aprile 2010
  • 32. UC1 - Analysis Graphs venerdì 16 aprile 2010
  • 33. UC1 - Analysis Graphs venerdì 16 aprile 2010
  • 34. UC2 - Monitor of tenders/announcements Monitor various sources which provide announcement and tenders to which people and companies are interested can subscribe We want to automate the long monitoring process of such sources and also automatically extract useful common information from announcements’ text venerdì 16 aprile 2010
  • 35. UC2 - Blocks venerdì 16 aprile 2010
  • 40. UC2 - Crawling Similar to UC1 Crawler but using a Firefox plugin we can define navigation patterns for pages of each source We can also define metadata we see during navigation that deliver information Again an XML will be generated so that it can be saved on a storage and executed periodically venerdì 16 aprile 2010
  • 41. UC2 - Defining navigation venerdì 16 aprile 2010
  • 42. UC2 - Domain annotations Language Funding type Abstract Geographic region Activity Sector Beneficiary Subject Budget Title Expiration date Tags venerdì 16 aprile 2010
  • 43. UC2 - Domain entities First and most important is an entity that represents the entire tender or announcement Annotations inside the domain will finally fill such entity properties venerdì 16 aprile 2010
  • 44. UC2 - Pipeline venerdì 16 aprile 2010
  • 45. UC2 - Simple first Each annotator first looks: if some metadata was extracted during navigation for the most common pattern for defining information inside such announcements i.e.: “Budget: 200000$” or “Financial information: ......” Such patterns are language independent (although this is often not true) venerdì 16 aprile 2010
  • 46. UC2 - AbstractAnnotator The abstract is usually in the first part of the document We use Tokenizer and Tagger to get Tokens (with PoS tags) and Sentences We use Dictionary to provide a list of “good” words We look in the first sentences of the document looking for objectives of the announcement (mixing good words and regular expressions) venerdì 16 aprile 2010
  • 47. UC2 - ExpirationDateAnnotator A DateAnnotator is executed before Iterate over DateAnnotations Get sentences wrapping such DateAnnotations Check if some terms like “deadline” appear in the same sentence of a DateAnnotation venerdì 16 aprile 2010
  • 53. Conclusions on IE UC1 : simple and stable sentence patterns UC2 : multi language, much more complex and different sentence structures and patterns Fine grain metadata are very important Need to play with NLP Need to establish good test cases venerdì 16 aprile 2010