SlideShare a Scribd company logo
1 of 26
Download to read offline
Natural language search in Solr
             Tommaso Teofili, Sourcesense
   t.teofili@sourcesense.com, October 19th 2011
Agenda
 An approach to natural language search in
  Solr
 Main points
  •   Solr-UIMA integration module
  •   Custom Lucene analyzers for UIMA
  •   OSS NLP algorithms in Lucene/Solr
  •   Orchestrating blocks to build a sample
      system able to understand natural language
      queries
 Results
My Background
 Software engineer at Sourcesense
  • Enterprise search consultant
 Member of the Apache Software Foundation
  •   UIMA
  •   Clerezza
  •   Stanbol
  •   DirectMemory
  •   ...
Google in ‘99
Google today
Google today
The Challenge
 Improved recall/precision
  • ‘articles about science’ (concepts)
  • ‘movies by K. Spacey’ vs ‘movies with K. Spacey’
 Easier experience for non-expert users
  • ‘people working at Google’ - ‘cities near London’
 Horizontal domains (e.g. Google)
 Vertical domains
Hurdles
 understanding documents’ text/user queries
 extract domain-specific/wide entities and
  concepts
 index/search performance
Use Case
   search engine for an online movies magazine
   Solr based
   non technical users
   time / cost
    • Solr 3.x setup : 2 mins
    • NLS setup / tweak : 5 days
 expecting
    • improved recall / precision
    • more time (clicks) on site ($)
Online movies magazine
General approach
 Natural language processing
 Processing documents at indexing time
  • document text analysis
  • write enriched text in (dedicated) fields
  • add custom types / payloads to terms
 Processing queries at searching time
  •   query analysis
  •   higher boosts to entities/concepts
  •   in-sentence search
  •   ...
NLP
 AI discipline
   • Computers understanding and managing
     information written in human language
 analyze text at various levels
 incrementally enrich / give structure
 extract concepts and named entities
Technical detail
 NLP algorithms plugged via Apache UIMA
 Indexing time
  • UpdateProcessor plugin (solr/contrib/uima)
  • Custom tokenizers/filters
 Search time
  • Custom QParserPlugin
Why Apache UIMA?
   OASIS standard for UIM
   TLP since March 2010
   Deploy pipelines of Analysis Engines
   AEs wrap NLP algorithms
   Scaling capabilities
NLP and OSS
 Sentence Split
   • OpenNLP, UIMA Addons, StanfordNLP
 PoS tagging
   • OpenNLP, UIMA Addons, StanfordNLP
 Chunking/Parsing
   • OpenNLP, StanfordNLP
 NER
   • OpenNLP, UIMA Addons, Stanbol, StanfordNLP
 Clustering/Classifying
   • Mahout, OpenNLP, StanfordNLP
 ...
Solr NLS architecture
UIMA Update Processor
Lucene analysis & UIMA
 Type : denote lexical types for tokens
 Payload : a byte array stored at each term
  position
 tokenize / filter tokens covered by a certain
  annotation type
 store UIMA annotations’ features in types /
  payloads
UIMA type-aware tokenizer
Solr NLS QParser
 analyze user query
 extract (and query on) concepts / entities
 use types/PoS in the query for
  • boosting terms
  • synonim expansion
 search within sentences
 faceting / clustering using entities
 identify ‘place queries’ and expand Solr spatial
  queries (for filtering / boosting)
Scaling architecture
Performance
 basic (in memory)
  • slower with NRT indexing
  • search could be significantly impacted
 ReST (SimpleServer)
  • faster
  • need to explictly digest results
 UIMA-AS
  • fast also with NRT indexing
  • fast search
  • scales nicely with lots of data
DisMax vs NLS
Wrap up
   general purpose architecture
   generally improved recall / precision
   NLP algorithms accuracy make the difference
   lots of OSS alternatives
   performances can be kept good
Sources
 Resources
  • http://svn.apache.org/repos/asf/lucene/dev/trunk/
    solr/contrib/uima/
  • https://github.com/tteofili/le11-nls
 Links
  • http://wiki.apache.org/solr/SolrUIMA
  • http://googleblog.blogspot.com/2010/01/helping-
    computers-understand-language.html
Thanks
 http://www.sourcesense.com

 t.teofili@sourcesense.com

 @tteofili

More Related Content

What's hot

Real-Time Processing of Spatial Data Using Kafka Streams, Ian Feeney & Roman ...
Real-Time Processing of Spatial Data Using Kafka Streams, Ian Feeney & Roman ...Real-Time Processing of Spatial Data Using Kafka Streams, Ian Feeney & Roman ...
Real-Time Processing of Spatial Data Using Kafka Streams, Ian Feeney & Roman ...
HostedbyConfluent
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
Guido Schmutz
 

What's hot (20)

Salesforce Communities
Salesforce CommunitiesSalesforce Communities
Salesforce Communities
 
Real-Time Processing of Spatial Data Using Kafka Streams, Ian Feeney & Roman ...
Real-Time Processing of Spatial Data Using Kafka Streams, Ian Feeney & Roman ...Real-Time Processing of Spatial Data Using Kafka Streams, Ian Feeney & Roman ...
Real-Time Processing of Spatial Data Using Kafka Streams, Ian Feeney & Roman ...
 
Solr vs ElasticSearch
Solr vs ElasticSearchSolr vs ElasticSearch
Solr vs ElasticSearch
 
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
 
Storing 16 Bytes at Scale
Storing 16 Bytes at ScaleStoring 16 Bytes at Scale
Storing 16 Bytes at Scale
 
Centralized Logging System Using ELK Stack
Centralized Logging System Using ELK StackCentralized Logging System Using ELK Stack
Centralized Logging System Using ELK Stack
 
SQL Extensions to Support Streaming Data With Fabian Hueske | Current 2022
SQL Extensions to Support Streaming Data With Fabian Hueske | Current 2022SQL Extensions to Support Streaming Data With Fabian Hueske | Current 2022
SQL Extensions to Support Streaming Data With Fabian Hueske | Current 2022
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
 
Salesforce Integration
Salesforce IntegrationSalesforce Integration
Salesforce Integration
 
An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.
 
옵저버빌러티(Observability) 확보로 서버리스 마이크로서비스 들여다보기 - 김형일 AWS 솔루션즈 아키텍트 :: AWS Summi...
옵저버빌러티(Observability) 확보로 서버리스 마이크로서비스 들여다보기 - 김형일 AWS 솔루션즈 아키텍트 :: AWS Summi...옵저버빌러티(Observability) 확보로 서버리스 마이크로서비스 들여다보기 - 김형일 AWS 솔루션즈 아키텍트 :: AWS Summi...
옵저버빌러티(Observability) 확보로 서버리스 마이크로서비스 들여다보기 - 김형일 AWS 솔루션즈 아키텍트 :: AWS Summi...
 
Allyourbase
AllyourbaseAllyourbase
Allyourbase
 
Salesforce APIs
Salesforce APIsSalesforce APIs
Salesforce APIs
 
elasticsearch_적용 및 활용_정리
elasticsearch_적용 및 활용_정리elasticsearch_적용 및 활용_정리
elasticsearch_적용 및 활용_정리
 
Episode 20 - Trigger Frameworks in Salesforce
Episode 20 - Trigger Frameworks in SalesforceEpisode 20 - Trigger Frameworks in Salesforce
Episode 20 - Trigger Frameworks in Salesforce
 
The Dual write problem
The Dual write problemThe Dual write problem
The Dual write problem
 
The Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and ContainersThe Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and Containers
 
Solr vs. Elasticsearch - Case by Case
Solr vs. Elasticsearch - Case by CaseSolr vs. Elasticsearch - Case by Case
Solr vs. Elasticsearch - Case by Case
 
Data Ingest Self Service and Management using Nifi and Kafka
Data Ingest Self Service and Management using Nifi and KafkaData Ingest Self Service and Management using Nifi and Kafka
Data Ingest Self Service and Management using Nifi and Kafka
 
Can SFUs and MCUs be friends @ IIT-RTC 2020
Can SFUs and MCUs be friends @ IIT-RTC 2020Can SFUs and MCUs be friends @ IIT-RTC 2020
Can SFUs and MCUs be friends @ IIT-RTC 2020
 

Viewers also liked

Advanced query parsing techniques
Advanced query parsing techniquesAdvanced query parsing techniques
Advanced query parsing techniques
lucenerevolution
 
An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene...
An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene...An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene...
An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene...
Lucidworks
 
Practical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+SolrPractical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+Solr
Jake Mannix
 
UIMA
UIMAUIMA
UIMA
otisg
 
Solr Query Parsing
Solr Query ParsingSolr Query Parsing
Solr Query Parsing
Erik Hatcher
 
Large Scale Log Analytics with Solr: Presented by Rafał Kuć & Radu Gheorghe, ...
Large Scale Log Analytics with Solr: Presented by Rafał Kuć & Radu Gheorghe, ...Large Scale Log Analytics with Solr: Presented by Rafał Kuć & Radu Gheorghe, ...
Large Scale Log Analytics with Solr: Presented by Rafał Kuć & Radu Gheorghe, ...
Lucidworks
 
Presentation of OpenNLP
Presentation of OpenNLPPresentation of OpenNLP
Presentation of OpenNLP
Robert Viseur
 

Viewers also liked (20)

Webinar: Natural Language Search with Solr
Webinar: Natural Language Search with SolrWebinar: Natural Language Search with Solr
Webinar: Natural Language Search with Solr
 
Shrinking the Haystack" using Solr and OpenNLP
Shrinking the Haystack" using Solr and OpenNLPShrinking the Haystack" using Solr and OpenNLP
Shrinking the Haystack" using Solr and OpenNLP
 
Webinar: OpenNLP and Solr for Superior Relevance
Webinar: OpenNLP and Solr for Superior RelevanceWebinar: OpenNLP and Solr for Superior Relevance
Webinar: OpenNLP and Solr for Superior Relevance
 
Semantic & Multilingual Strategies in Lucene/Solr
Semantic & Multilingual Strategies in Lucene/SolrSemantic & Multilingual Strategies in Lucene/Solr
Semantic & Multilingual Strategies in Lucene/Solr
 
Apache UIMA Introduction
Apache UIMA IntroductionApache UIMA Introduction
Apache UIMA Introduction
 
Sais svcc
Sais svccSais svcc
Sais svcc
 
Pablo Duboue
Pablo DubouePablo Duboue
Pablo Duboue
 
Advanced query parsing techniques
Advanced query parsing techniquesAdvanced query parsing techniques
Advanced query parsing techniques
 
An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene...
An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene...An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene...
An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene...
 
Practical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+SolrPractical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+Solr
 
UIMA
UIMAUIMA
UIMA
 
Apache UIMA and Semantic Search
Apache UIMA and Semantic SearchApache UIMA and Semantic Search
Apache UIMA and Semantic Search
 
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
 
Solr Query Parsing
Solr Query ParsingSolr Query Parsing
Solr Query Parsing
 
Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...
Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...
Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...
 
Webinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrWebinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with Solr
 
Real-Time Analytics with Solr: Presented by Yonik Seeley, Cloudera
Real-Time Analytics with Solr: Presented by Yonik Seeley, ClouderaReal-Time Analytics with Solr: Presented by Yonik Seeley, Cloudera
Real-Time Analytics with Solr: Presented by Yonik Seeley, Cloudera
 
Large Scale Log Analytics with Solr: Presented by Rafał Kuć & Radu Gheorghe, ...
Large Scale Log Analytics with Solr: Presented by Rafał Kuć & Radu Gheorghe, ...Large Scale Log Analytics with Solr: Presented by Rafał Kuć & Radu Gheorghe, ...
Large Scale Log Analytics with Solr: Presented by Rafał Kuć & Radu Gheorghe, ...
 
Sentiment Analysis Using Solr
Sentiment Analysis Using SolrSentiment Analysis Using Solr
Sentiment Analysis Using Solr
 
Presentation of OpenNLP
Presentation of OpenNLPPresentation of OpenNLP
Presentation of OpenNLP
 

Similar to Natural Language Search in Solr

Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
Rahul Jain
 
2010 10-building-global-listening-platform-with-solr
2010 10-building-global-listening-platform-with-solr2010 10-building-global-listening-platform-with-solr
2010 10-building-global-listening-platform-with-solr
Lucidworks (Archived)
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
lucenerevolution
 
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCWhat's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
Lucidworks (Archived)
 
Lucene BootCamp
Lucene BootCampLucene BootCamp
Lucene BootCamp
GokulD
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
Erik Hatcher
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data Ecosystem
Trey Grainger
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
Erik Hatcher
 

Similar to Natural Language Search in Solr (20)

Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrLet's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
 
Enterprise Search Using Apache Solr
Enterprise Search Using Apache SolrEnterprise Search Using Apache Solr
Enterprise Search Using Apache Solr
 
Building NLP solutions using Python
Building NLP solutions using PythonBuilding NLP solutions using Python
Building NLP solutions using Python
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and Usecases
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
 
2010 10-building-global-listening-platform-with-solr
2010 10-building-global-listening-platform-with-solr2010 10-building-global-listening-platform-with-solr
2010 10-building-global-listening-platform-with-solr
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 
ElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learnedElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learned
 
Self-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache SolrSelf-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache Solr
 
How the Lucene More Like This Works
How the Lucene More Like This WorksHow the Lucene More Like This Works
How the Lucene More Like This Works
 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platform
 
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCWhat's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
 
Lucene BootCamp
Lucene BootCampLucene BootCamp
Lucene BootCamp
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data Ecosystem
 
Elasticsearch Basics
Elasticsearch BasicsElasticsearch Basics
Elasticsearch Basics
 
ICIC 2013 Conference Proceedings Andreas Pesenhofer max.recall
ICIC 2013 Conference Proceedings Andreas Pesenhofer max.recallICIC 2013 Conference Proceedings Andreas Pesenhofer max.recall
ICIC 2013 Conference Proceedings Andreas Pesenhofer max.recall
 
Apache solr
Apache solrApache solr
Apache solr
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Apache Lucene 4
Apache Lucene 4Apache Lucene 4
Apache Lucene 4
 

More from Tommaso Teofili

Search engines in the industry
Search engines in the industrySearch engines in the industry
Search engines in the industry
Tommaso Teofili
 
Scaling search in Oak with Solr
Scaling search in Oak with Solr Scaling search in Oak with Solr
Scaling search in Oak with Solr
Tommaso Teofili
 
Text categorization with Lucene and Solr
Text categorization with Lucene and SolrText categorization with Lucene and Solr
Text categorization with Lucene and Solr
Tommaso Teofili
 
Machine learning with Apache Hama
Machine learning with Apache HamaMachine learning with Apache Hama
Machine learning with Apache Hama
Tommaso Teofili
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
Tommaso Teofili
 

More from Tommaso Teofili (16)

Affect Enriched Word Embeddings for News IR
Affect Enriched Word Embeddings for News IRAffect Enriched Word Embeddings for News IR
Affect Enriched Word Embeddings for News IR
 
Flexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit OakFlexible search in Apache Jackrabbit Oak
Flexible search in Apache Jackrabbit Oak
 
Data replication in Sling
Data replication in SlingData replication in Sling
Data replication in Sling
 
Search engines in the industry
Search engines in the industrySearch engines in the industry
Search engines in the industry
 
Scaling search in Oak with Solr
Scaling search in Oak with Solr Scaling search in Oak with Solr
Scaling search in Oak with Solr
 
Text categorization with Lucene and Solr
Text categorization with Lucene and SolrText categorization with Lucene and Solr
Text categorization with Lucene and Solr
 
Machine learning with Apache Hama
Machine learning with Apache HamaMachine learning with Apache Hama
Machine learning with Apache Hama
 
Adapting Apache UIMA to OSGi
Adapting Apache UIMA to OSGiAdapting Apache UIMA to OSGi
Adapting Apache UIMA to OSGi
 
Oak / Solr integration
Oak / Solr integrationOak / Solr integration
Oak / Solr integration
 
Domeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and ClerezzaDomeo, Text Mining, UIMA and Clerezza
Domeo, Text Mining, UIMA and Clerezza
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 
Apache UIMA - Hands on code
Apache UIMA - Hands on codeApache UIMA - Hands on code
Apache UIMA - Hands on code
 
OSS Enterprise Search EU Tour
OSS Enterprise Search EU TourOSS Enterprise Search EU Tour
OSS Enterprise Search EU Tour
 
Information Extraction with UIMA - Usecases
Information Extraction with UIMA - UsecasesInformation Extraction with UIMA - Usecases
Information Extraction with UIMA - Usecases
 
Apache UIMA and Metadata Generation
Apache UIMA and Metadata GenerationApache UIMA and Metadata Generation
Apache UIMA and Metadata Generation
 
Data and Information Extraction on the Web
Data and Information Extraction on the WebData and Information Extraction on the Web
Data and Information Extraction on the Web
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Recently uploaded (20)

[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 

Natural Language Search in Solr

  • 1. Natural language search in Solr Tommaso Teofili, Sourcesense t.teofili@sourcesense.com, October 19th 2011
  • 2. Agenda  An approach to natural language search in Solr  Main points • Solr-UIMA integration module • Custom Lucene analyzers for UIMA • OSS NLP algorithms in Lucene/Solr • Orchestrating blocks to build a sample system able to understand natural language queries  Results
  • 3. My Background  Software engineer at Sourcesense • Enterprise search consultant  Member of the Apache Software Foundation • UIMA • Clerezza • Stanbol • DirectMemory • ...
  • 7. The Challenge  Improved recall/precision • ‘articles about science’ (concepts) • ‘movies by K. Spacey’ vs ‘movies with K. Spacey’  Easier experience for non-expert users • ‘people working at Google’ - ‘cities near London’  Horizontal domains (e.g. Google)  Vertical domains
  • 8. Hurdles  understanding documents’ text/user queries  extract domain-specific/wide entities and concepts  index/search performance
  • 9. Use Case  search engine for an online movies magazine  Solr based  non technical users  time / cost • Solr 3.x setup : 2 mins • NLS setup / tweak : 5 days  expecting • improved recall / precision • more time (clicks) on site ($)
  • 11. General approach  Natural language processing  Processing documents at indexing time • document text analysis • write enriched text in (dedicated) fields • add custom types / payloads to terms  Processing queries at searching time • query analysis • higher boosts to entities/concepts • in-sentence search • ...
  • 12. NLP  AI discipline • Computers understanding and managing information written in human language  analyze text at various levels  incrementally enrich / give structure  extract concepts and named entities
  • 13. Technical detail  NLP algorithms plugged via Apache UIMA  Indexing time • UpdateProcessor plugin (solr/contrib/uima) • Custom tokenizers/filters  Search time • Custom QParserPlugin
  • 14. Why Apache UIMA?  OASIS standard for UIM  TLP since March 2010  Deploy pipelines of Analysis Engines  AEs wrap NLP algorithms  Scaling capabilities
  • 15. NLP and OSS  Sentence Split • OpenNLP, UIMA Addons, StanfordNLP  PoS tagging • OpenNLP, UIMA Addons, StanfordNLP  Chunking/Parsing • OpenNLP, StanfordNLP  NER • OpenNLP, UIMA Addons, Stanbol, StanfordNLP  Clustering/Classifying • Mahout, OpenNLP, StanfordNLP  ...
  • 18. Lucene analysis & UIMA  Type : denote lexical types for tokens  Payload : a byte array stored at each term position  tokenize / filter tokens covered by a certain annotation type  store UIMA annotations’ features in types / payloads
  • 20. Solr NLS QParser  analyze user query  extract (and query on) concepts / entities  use types/PoS in the query for • boosting terms • synonim expansion  search within sentences  faceting / clustering using entities  identify ‘place queries’ and expand Solr spatial queries (for filtering / boosting)
  • 22. Performance  basic (in memory) • slower with NRT indexing • search could be significantly impacted  ReST (SimpleServer) • faster • need to explictly digest results  UIMA-AS • fast also with NRT indexing • fast search • scales nicely with lots of data
  • 24. Wrap up  general purpose architecture  generally improved recall / precision  NLP algorithms accuracy make the difference  lots of OSS alternatives  performances can be kept good
  • 25. Sources  Resources • http://svn.apache.org/repos/asf/lucene/dev/trunk/ solr/contrib/uima/ • https://github.com/tteofili/le11-nls  Links • http://wiki.apache.org/solr/SolrUIMA • http://googleblog.blogspot.com/2010/01/helping- computers-understand-language.html