SlideShare a Scribd company logo
1 of 33
Introduction to
Apache Lucene/Solr
April 2014 HDSG Meetup
Rahul Jain
@rahuldausa
Who am I?
 Software Engineer @ IVY Comptech, Hyderabad
 7 years of programming learning experience
 Built a platform to search logs in Near real time with
volume of 1TB/day#
 Worked on a Solr search based SEO/SEM software with
40 billion records/month (Topic of next talk?)
 Areas of expertise/interest
 High traffic web applications
 JAVA/J2EE
 Big data, NoSQL
 Information-Retrieval, Machine learning
2# http://www.slideshare.net/lucenerevolution/building-a-near-real-time-search-engine-analytics-for-logs-using-solr
Agenda
• IR Overview
• Basic Concepts
• Lucene
• Solr
• Use-cases
• Solr In Action (demo)
• Q&A
3
Information Retrieval (IR)
”Information retrieval is the activity of
obtaining information resources (in the
form of documents) relevant to an
information need from a collection of
information resources. Searches can
be based on metadata or on full-text
(or other content-based) indexing”
- Wikipedia
4
Basic Concepts
• tf (t in d) : term frequency in a document
• measure of how often a term appears in the document
• the number of times term t appears in the currently scored
document d
• idf (t) : inverse document frequency
• measure of whether the term is common or rare across all
documents, i.e. how often the term appears across the index
• obtained by dividing the total number of documents by the
number of documents containing the term, and then taking
the logarithm of that quotient.
• boost (index) : boost of the field at index-time
• boost (query) : boost of the field at query-time
5
Basic Concepts
TF - IDF
TF - IDF = Term Frequency X Inverse Document Frequency
Credit: http://http://whatisgraphsearch.com/
Apache Lucene
7
Apache Lucene
• Fast, high performance, scalable search/IR library
• Open source
• Initially developed by Doug Cutting (Also author
of Hadoop)
• Indexing and Searching
• Inverted Index of documents
• Provides advanced Search options like
synonyms, stopwords, based on
similarity, proximity.
• http://lucene.apache.org/ 8
Lucene Internals - Inverted Index
Credit: https://developer.apple.com/library/mac/documentation/userexperience/conceptual/SearchKitConcepts/searchKit_basics/searchKit_basics.html
9
Lucene Internals (Contd.)
• Defines documents Model
• Index contains documents.
• Each document consist of fields.
• Each Field has attributes.
– What is the data type (FieldType)
– How to handle the content (Analyzers, Filters)
– Is it a stored field (stored="true") or Index field (indexed="true")
10
Indexing Pipeline
• Analyzer : create tokens using a Tokenizer and/or applying
Filters (Token Filters)
• Each field can define an Analyzer at index time/query time or
the both at same time.
Credit : http://www.slideshare.net/otisg/lucene-introduction 11
Analysis Process - Tokenizer
WhitespaceAnalyzer
Simplest built-in analyzer
The quick brown fox jumps over the lazy dog.
[The] [quick] [brown] [fox] [jumps] [over] [the] [lazy] [dog.]
Tokens
Analysis Process - Tokenizer
SimpleAnalyzer
Lowercases, split at non-letter boundaries
The quick brown fox jumps over the lazy dog.
[the] [quick] [brown] [fox] [jumps] [over] [the] [lazy] [dog]
Tokens
Apache Solr
14
Apache Solr
• Created by Yonik Seeley for CNET
• Enterprise Search platform for Apache Lucene
• Open source
• Highly reliable, scalable, fault tolerant
• Support distributed Indexing (SolrCloud), Replication, and
load balanced querying
• http://lucene.apache.org/solr
15
High level overview
Source: http://www.slideshare.net/erikhatcher/solr-search-at-the-speed-of-light
Apache Solr - Features
• full-text search
• faceted search (similar to GroupBy clause in RDBMS)
• scalability
– caching
– replication
– distributed search
• near real-time indexing
• geospatial search
• and many more : highlighting, database integration, rich document
(e.g., Word, PDF) handling
17
How to start
It’s very Easy.
1. Start Solr
java -jar start.jar
2. Index your data
java -jar post.jar *.xml
3. Search
http://localhost:8983/solr
Solr APIs
• HTTP GET/POST
• JSON/XML
• Clients
– SolrJ (embedded or HTTP)
– solr-ruby
– python, PHP, solrsharp
Solr – schema.xml
• Types with index and query Analyzers - similar to data
type
• Fields with name, type and options
• Unique Key : Unique Identifier of a document. For e.g. “id”
• Dynamic Fields : Dynamic fields allow Solr to index fields that you did not
explicitly define in your schema. For e.g. fieldName: *_i or *_txts
• Copy Fields : Solr has a mechanism for making copies of fields so that you can apply
several distinct field types to a single piece of incoming information. field ‘a‘ populates field ‘b’ with
its value before tokenizing (having different analyzer/filter).
20
Solr – Content Analysis
• Field Attributes
 Name : Name of the field
 Type : Data-type (FieldType) of the field
 Indexed : Should it be indexed (indexed="true/false")
 Stored : Should it be stored (stored="true/false")
 Required : is it a mandatory field
(required="true/false")
 Multi-Valued : Would it will contains multiple values
e.g. text: pizza, food (multiValued="true/false")
e.g. <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
21
Solr – solrconfig.xml
• Data dir: where all index data will be stored
• Index configuration
• Cache configurations
• Request Handler configuration
• Search components, response writers, query
parsers
22
Query Types
• Single and multi term queries
• ex fieldname:value or title: software engineer
• +, -, AND, OR NOT operators.
• ex. title: (software AND engineer)
• Range queries on date or numeric fields,
• ex: timestamp: [ * TO NOW ] or price: [ 1 TO 100 ]
• Boost queries:
• e.g. title:Engineer ^1.5 OR text:Engineer
• Fuzzy search : is a search for words that are similar in
spelling
• e.g. roam~0.8 => noam
• Proximity Search : with a sloppy phrase query. The
close together the two terms appear, higher the score.
• ex “apache lucene”~20 : will look for all documents where
“apache” word occurs within 20 words of “lucene”
23
Solr/Lucene Use-cases
• Search
• Analytics
• NoSQL datastore
• Auto-suggestion / Auto-correction
• Recommendation Engine (MoreLikeThis)
• Relevancy Engine (Feedback to other applications)
• Solr as a White-List
• GeoSpatial based Search
24
Search
• Application
– Eclipse, Hibernate search
• E-Commerce :
– Flipkart.com, Infibeam.com, Buy.com, Netflix.com, ebay.com
• Jobs
– Indeed.com, Simplyhired.com, Naukri.com
• Auto
– AOL.com
• Travel
– Cleartrip.com
• Social Network
– Twitter.com, LinkedIn.com, mylife.com
25
Source: http://www.quora.com/Which-major-companies-are-using-Solr-for-search
Search (Contd.)
• Search Engine
– Yandex.ru, DuckDuckGo.com
• News Paper
– Guardian.co.uk
• Music/Movies
– Apple.com, Netflix.com
• Events
– Stubhub.com, Eventbrite.com
• Cloud Log Management
– Loggly.com
• Others
– Whitehouse.gov
26
Faceting
Source: www.career9.com, www.indeed.com 27
• Grouping results based on field
value
• Facet on: field
terms, queries, date ranges
• &facet=on
&facet.field=job_title
&facet.query=salary:[30000 TO
100000]
• http://wiki.apache.org/solr/Sim
pleFacetParameters
Analytics
 Analytics source : Kibana.org based on ElasticSearch and Logstash
 Image Source : http://semicomplete.com/presentations/logstash-monitorama-2013/#/8
28
Autosuggestion
Source: www.drupal.org , www.yelp.com 29
Integration
• Clustering (Solr-Carrot2)
• Named Entity extraction (Solr-UIMA)
• SolrCloud (Solr-Zookeeper)
• Parsing of many Different File Formats (Solr-Tika)
• Machine Learning/Data Mining (Apache Mahout)
• Large scale Indexing (Hadoop)
30
References
• http://en.wikipedia.org/wiki/Tf%E2%80%93idf
• http://lucene.apache.org/core/4_5_0/core/org/apache/lucene/search/similarities
/TFIDFSimilarity.html
• http://www.quora.com/Which-major-companies-are-using-Solr-for-search
• http://marc.info/?l=solr-user&m=137271228610366&w=2
• http://java.dzone.com/articles/apache-solr-get-started-get
31
Solr/Lucene Meetup
• Building Big Data Analytics Platforms using Elasticsearch
(Kibana)
• Saturday, April 19, 2014 10:00 AM
• IIIT Hyderabad
• URL: http://www.meetup.com/Hyderabad-Apache-Solr-Lucene-Group/events/150134392/
OR
• Search on Google …
Thanks!
@rahuldausa on twitter and slideshare
http://www.linkedin.com/in/rahuldausa
Find Interesting ?
Join us @ http://www.meetup.com/Hyderabad-Apache-Solr-Lucene-Group/
33

More Related Content

What's hot

Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with SolrErik Hatcher
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to SolrErik Hatcher
 
Get the most out of Solr search with PHP
Get the most out of Solr search with PHPGet the most out of Solr search with PHP
Get the most out of Solr search with PHPPaul Borgermans
 
Beyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBeyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBertrand Delacretaz
 
20130310 solr tuorial
20130310 solr tuorial20130310 solr tuorial
20130310 solr tuorialChris Huang
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and SolrGrant Ingersoll
 
Enterprise Search Using Apache Solr
Enterprise Search Using Apache SolrEnterprise Search Using Apache Solr
Enterprise Search Using Apache Solrsagar chaturvedi
 
Retrieving Information From Solr
Retrieving Information From SolrRetrieving Information From Solr
Retrieving Information From SolrRamzi Alqrainy
 
Building your own search engine with Apache Solr
Building your own search engine with Apache SolrBuilding your own search engine with Apache Solr
Building your own search engine with Apache SolrBiogeeks
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes WorkshopErik Hatcher
 
Solr Application Development Tutorial
Solr Application Development TutorialSolr Application Development Tutorial
Solr Application Development TutorialErik Hatcher
 
New-Age Search through Apache Solr
New-Age Search through Apache SolrNew-Age Search through Apache Solr
New-Age Search through Apache SolrEdureka!
 
Apache Solr/Lucene Internals by Anatoliy Sokolenko
Apache Solr/Lucene Internals  by Anatoliy SokolenkoApache Solr/Lucene Internals  by Anatoliy Sokolenko
Apache Solr/Lucene Internals by Anatoliy SokolenkoProvectus
 
Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)Erik Hatcher
 
Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5israelekpo
 
Battle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearchBattle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearchRafał Kuć
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with SolrErik Hatcher
 

What's hot (20)

Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Get the most out of Solr search with PHP
Get the most out of Solr search with PHPGet the most out of Solr search with PHP
Get the most out of Solr search with PHP
 
Beyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBeyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and Solr
 
20130310 solr tuorial
20130310 solr tuorial20130310 solr tuorial
20130310 solr tuorial
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and Solr
 
Enterprise Search Using Apache Solr
Enterprise Search Using Apache SolrEnterprise Search Using Apache Solr
Enterprise Search Using Apache Solr
 
Retrieving Information From Solr
Retrieving Information From SolrRetrieving Information From Solr
Retrieving Information From Solr
 
Building your own search engine with Apache Solr
Building your own search engine with Apache SolrBuilding your own search engine with Apache Solr
Building your own search engine with Apache Solr
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes Workshop
 
Solr Application Development Tutorial
Solr Application Development TutorialSolr Application Development Tutorial
Solr Application Development Tutorial
 
New-Age Search through Apache Solr
New-Age Search through Apache SolrNew-Age Search through Apache Solr
New-Age Search through Apache Solr
 
Apache Solr/Lucene Internals by Anatoliy Sokolenko
Apache Solr/Lucene Internals  by Anatoliy SokolenkoApache Solr/Lucene Internals  by Anatoliy Sokolenko
Apache Solr/Lucene Internals by Anatoliy Sokolenko
 
Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)
 
Solr Recipes
Solr RecipesSolr Recipes
Solr Recipes
 
Lucene basics
Lucene basicsLucene basics
Lucene basics
 
Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5
 
Solr Architecture
Solr ArchitectureSolr Architecture
Solr Architecture
 
Battle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearchBattle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearch
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 

Similar to Introduction to Apache Lucene/Solr

Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to SolrErik Hatcher
 
Introduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneIntroduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneRahul Jain
 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platformTommaso Teofili
 
Solr Powered Lucene
Solr Powered LuceneSolr Powered Lucene
Solr Powered LuceneErik Hatcher
 
Apache Solr Workshop
Apache Solr WorkshopApache Solr Workshop
Apache Solr WorkshopJSGB
 
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrLet's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrSease
 
Solr search engine with multiple table relation
Solr search engine with multiple table relationSolr search engine with multiple table relation
Solr search engine with multiple table relationJay Bharat
 
Scaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrScaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrTrey Grainger
 
Intro to Apache Solr for Drupal
Intro to Apache Solr for DrupalIntro to Apache Solr for Drupal
Intro to Apache Solr for DrupalChris Caple
 
Apache Solr-Webinar
Apache Solr-WebinarApache Solr-Webinar
Apache Solr-WebinarEdureka!
 
Solr at zvents 6 years later & still going strong
Solr at zvents   6 years later & still going strongSolr at zvents   6 years later & still going strong
Solr at zvents 6 years later & still going stronglucenerevolution
 
Practical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+SolrPractical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+SolrJake Mannix
 
Practical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and SparkPractical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and SparkJake Mannix
 
Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search Hortonworks
 

Similar to Introduction to Apache Lucene/Solr (20)

Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Introduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneIntroduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of Lucene
 
Apache solr
Apache solrApache solr
Apache solr
 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platform
 
Solr Powered Lucene
Solr Powered LuceneSolr Powered Lucene
Solr Powered Lucene
 
Solr
SolrSolr
Solr
 
Apache Solr Workshop
Apache Solr WorkshopApache Solr Workshop
Apache Solr Workshop
 
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrLet's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
 
Solr search engine with multiple table relation
Solr search engine with multiple table relationSolr search engine with multiple table relation
Solr search engine with multiple table relation
 
Scaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrScaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solr
 
Solr 8 interview
Solr 8 interview Solr 8 interview
Solr 8 interview
 
Intro to Apache Solr for Drupal
Intro to Apache Solr for DrupalIntro to Apache Solr for Drupal
Intro to Apache Solr for Drupal
 
Solr 101
Solr 101Solr 101
Solr 101
 
Apache Solr-Webinar
Apache Solr-WebinarApache Solr-Webinar
Apache Solr-Webinar
 
SolrCloud on Hadoop
SolrCloud on HadoopSolrCloud on Hadoop
SolrCloud on Hadoop
 
Solr at zvents 6 years later & still going strong
Solr at zvents   6 years later & still going strongSolr at zvents   6 years later & still going strong
Solr at zvents 6 years later & still going strong
 
Practical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+SolrPractical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Spark+Solr
 
Practical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and SparkPractical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and Spark
 
Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search
 
Multilingual searchapi
Multilingual searchapiMultilingual searchapi
Multilingual searchapi
 

More from Rahul Jain

Flipkart Strategy Analysis and Recommendation
Flipkart Strategy Analysis and RecommendationFlipkart Strategy Analysis and Recommendation
Flipkart Strategy Analysis and RecommendationRahul Jain
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataRahul Jain
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Rahul Jain
 
Building a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrBuilding a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrRahul Jain
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningRahul Jain
 
Introduction to Scala
Introduction to ScalaIntroduction to Scala
Introduction to ScalaRahul Jain
 
What is NoSQL and CAP Theorem
What is NoSQL and CAP TheoremWhat is NoSQL and CAP Theorem
What is NoSQL and CAP TheoremRahul Jain
 
Introduction to Kafka and Zookeeper
Introduction to Kafka and ZookeeperIntroduction to Kafka and Zookeeper
Introduction to Kafka and ZookeeperRahul Jain
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersRahul Jain
 
Hibernate tutorial for beginners
Hibernate tutorial for beginnersHibernate tutorial for beginners
Hibernate tutorial for beginnersRahul Jain
 

More from Rahul Jain (13)

Flipkart Strategy Analysis and Recommendation
Flipkart Strategy Analysis and RecommendationFlipkart Strategy Analysis and Recommendation
Flipkart Strategy Analysis and Recommendation
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )
 
Building a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrBuilding a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache Solr
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Introduction to Scala
Introduction to ScalaIntroduction to Scala
Introduction to Scala
 
What is NoSQL and CAP Theorem
What is NoSQL and CAP TheoremWhat is NoSQL and CAP Theorem
What is NoSQL and CAP Theorem
 
Introduction to Kafka and Zookeeper
Introduction to Kafka and ZookeeperIntroduction to Kafka and Zookeeper
Introduction to Kafka and Zookeeper
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for Beginners
 
Hibernate tutorial for beginners
Hibernate tutorial for beginnersHibernate tutorial for beginners
Hibernate tutorial for beginners
 

Recently uploaded

US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionMebane Rash
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxk795866
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgUnit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgsaravananr517913
 
welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the weldingMuhammadUzairLiaqat
 
Transport layer issues and challenges - Guide
Transport layer issues and challenges - GuideTransport layer issues and challenges - Guide
Transport layer issues and challenges - GuideGOPINATHS437943
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catcherssdickerson1
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfAsst.prof M.Gokilavani
 
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfme23b1001
 
Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substationstephanwindworld
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerAnamika Sarkar
 
Indian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptIndian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptMadan Karki
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...asadnawaz62
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfAsst.prof M.Gokilavani
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)Dr SOUNDIRARAJ N
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfROCENODodongVILLACER
 
8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitter8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitterShivangiSharma879191
 

Recently uploaded (20)

US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of Action
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptx
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgUnit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
 
welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the welding
 
Transport layer issues and challenges - Guide
Transport layer issues and challenges - GuideTransport layer issues and challenges - Guide
Transport layer issues and challenges - Guide
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
 
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdf
 
Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substation
 
POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
 
Indian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptIndian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.ppt
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdf
 
8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitter8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitter
 

Introduction to Apache Lucene/Solr

  • 1. Introduction to Apache Lucene/Solr April 2014 HDSG Meetup Rahul Jain @rahuldausa
  • 2. Who am I?  Software Engineer @ IVY Comptech, Hyderabad  7 years of programming learning experience  Built a platform to search logs in Near real time with volume of 1TB/day#  Worked on a Solr search based SEO/SEM software with 40 billion records/month (Topic of next talk?)  Areas of expertise/interest  High traffic web applications  JAVA/J2EE  Big data, NoSQL  Information-Retrieval, Machine learning 2# http://www.slideshare.net/lucenerevolution/building-a-near-real-time-search-engine-analytics-for-logs-using-solr
  • 3. Agenda • IR Overview • Basic Concepts • Lucene • Solr • Use-cases • Solr In Action (demo) • Q&A 3
  • 4. Information Retrieval (IR) ”Information retrieval is the activity of obtaining information resources (in the form of documents) relevant to an information need from a collection of information resources. Searches can be based on metadata or on full-text (or other content-based) indexing” - Wikipedia 4
  • 5. Basic Concepts • tf (t in d) : term frequency in a document • measure of how often a term appears in the document • the number of times term t appears in the currently scored document d • idf (t) : inverse document frequency • measure of whether the term is common or rare across all documents, i.e. how often the term appears across the index • obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient. • boost (index) : boost of the field at index-time • boost (query) : boost of the field at query-time 5
  • 6. Basic Concepts TF - IDF TF - IDF = Term Frequency X Inverse Document Frequency Credit: http://http://whatisgraphsearch.com/
  • 8. Apache Lucene • Fast, high performance, scalable search/IR library • Open source • Initially developed by Doug Cutting (Also author of Hadoop) • Indexing and Searching • Inverted Index of documents • Provides advanced Search options like synonyms, stopwords, based on similarity, proximity. • http://lucene.apache.org/ 8
  • 9. Lucene Internals - Inverted Index Credit: https://developer.apple.com/library/mac/documentation/userexperience/conceptual/SearchKitConcepts/searchKit_basics/searchKit_basics.html 9
  • 10. Lucene Internals (Contd.) • Defines documents Model • Index contains documents. • Each document consist of fields. • Each Field has attributes. – What is the data type (FieldType) – How to handle the content (Analyzers, Filters) – Is it a stored field (stored="true") or Index field (indexed="true") 10
  • 11. Indexing Pipeline • Analyzer : create tokens using a Tokenizer and/or applying Filters (Token Filters) • Each field can define an Analyzer at index time/query time or the both at same time. Credit : http://www.slideshare.net/otisg/lucene-introduction 11
  • 12. Analysis Process - Tokenizer WhitespaceAnalyzer Simplest built-in analyzer The quick brown fox jumps over the lazy dog. [The] [quick] [brown] [fox] [jumps] [over] [the] [lazy] [dog.] Tokens
  • 13. Analysis Process - Tokenizer SimpleAnalyzer Lowercases, split at non-letter boundaries The quick brown fox jumps over the lazy dog. [the] [quick] [brown] [fox] [jumps] [over] [the] [lazy] [dog] Tokens
  • 15. Apache Solr • Created by Yonik Seeley for CNET • Enterprise Search platform for Apache Lucene • Open source • Highly reliable, scalable, fault tolerant • Support distributed Indexing (SolrCloud), Replication, and load balanced querying • http://lucene.apache.org/solr 15
  • 16. High level overview Source: http://www.slideshare.net/erikhatcher/solr-search-at-the-speed-of-light
  • 17. Apache Solr - Features • full-text search • faceted search (similar to GroupBy clause in RDBMS) • scalability – caching – replication – distributed search • near real-time indexing • geospatial search • and many more : highlighting, database integration, rich document (e.g., Word, PDF) handling 17
  • 18. How to start It’s very Easy. 1. Start Solr java -jar start.jar 2. Index your data java -jar post.jar *.xml 3. Search http://localhost:8983/solr
  • 19. Solr APIs • HTTP GET/POST • JSON/XML • Clients – SolrJ (embedded or HTTP) – solr-ruby – python, PHP, solrsharp
  • 20. Solr – schema.xml • Types with index and query Analyzers - similar to data type • Fields with name, type and options • Unique Key : Unique Identifier of a document. For e.g. “id” • Dynamic Fields : Dynamic fields allow Solr to index fields that you did not explicitly define in your schema. For e.g. fieldName: *_i or *_txts • Copy Fields : Solr has a mechanism for making copies of fields so that you can apply several distinct field types to a single piece of incoming information. field ‘a‘ populates field ‘b’ with its value before tokenizing (having different analyzer/filter). 20
  • 21. Solr – Content Analysis • Field Attributes  Name : Name of the field  Type : Data-type (FieldType) of the field  Indexed : Should it be indexed (indexed="true/false")  Stored : Should it be stored (stored="true/false")  Required : is it a mandatory field (required="true/false")  Multi-Valued : Would it will contains multiple values e.g. text: pizza, food (multiValued="true/false") e.g. <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" /> 21
  • 22. Solr – solrconfig.xml • Data dir: where all index data will be stored • Index configuration • Cache configurations • Request Handler configuration • Search components, response writers, query parsers 22
  • 23. Query Types • Single and multi term queries • ex fieldname:value or title: software engineer • +, -, AND, OR NOT operators. • ex. title: (software AND engineer) • Range queries on date or numeric fields, • ex: timestamp: [ * TO NOW ] or price: [ 1 TO 100 ] • Boost queries: • e.g. title:Engineer ^1.5 OR text:Engineer • Fuzzy search : is a search for words that are similar in spelling • e.g. roam~0.8 => noam • Proximity Search : with a sloppy phrase query. The close together the two terms appear, higher the score. • ex “apache lucene”~20 : will look for all documents where “apache” word occurs within 20 words of “lucene” 23
  • 24. Solr/Lucene Use-cases • Search • Analytics • NoSQL datastore • Auto-suggestion / Auto-correction • Recommendation Engine (MoreLikeThis) • Relevancy Engine (Feedback to other applications) • Solr as a White-List • GeoSpatial based Search 24
  • 25. Search • Application – Eclipse, Hibernate search • E-Commerce : – Flipkart.com, Infibeam.com, Buy.com, Netflix.com, ebay.com • Jobs – Indeed.com, Simplyhired.com, Naukri.com • Auto – AOL.com • Travel – Cleartrip.com • Social Network – Twitter.com, LinkedIn.com, mylife.com 25 Source: http://www.quora.com/Which-major-companies-are-using-Solr-for-search
  • 26. Search (Contd.) • Search Engine – Yandex.ru, DuckDuckGo.com • News Paper – Guardian.co.uk • Music/Movies – Apple.com, Netflix.com • Events – Stubhub.com, Eventbrite.com • Cloud Log Management – Loggly.com • Others – Whitehouse.gov 26
  • 27. Faceting Source: www.career9.com, www.indeed.com 27 • Grouping results based on field value • Facet on: field terms, queries, date ranges • &facet=on &facet.field=job_title &facet.query=salary:[30000 TO 100000] • http://wiki.apache.org/solr/Sim pleFacetParameters
  • 28. Analytics  Analytics source : Kibana.org based on ElasticSearch and Logstash  Image Source : http://semicomplete.com/presentations/logstash-monitorama-2013/#/8 28
  • 30. Integration • Clustering (Solr-Carrot2) • Named Entity extraction (Solr-UIMA) • SolrCloud (Solr-Zookeeper) • Parsing of many Different File Formats (Solr-Tika) • Machine Learning/Data Mining (Apache Mahout) • Large scale Indexing (Hadoop) 30
  • 31. References • http://en.wikipedia.org/wiki/Tf%E2%80%93idf • http://lucene.apache.org/core/4_5_0/core/org/apache/lucene/search/similarities /TFIDFSimilarity.html • http://www.quora.com/Which-major-companies-are-using-Solr-for-search • http://marc.info/?l=solr-user&m=137271228610366&w=2 • http://java.dzone.com/articles/apache-solr-get-started-get 31
  • 32. Solr/Lucene Meetup • Building Big Data Analytics Platforms using Elasticsearch (Kibana) • Saturday, April 19, 2014 10:00 AM • IIIT Hyderabad • URL: http://www.meetup.com/Hyderabad-Apache-Solr-Lucene-Group/events/150134392/ OR • Search on Google …
  • 33. Thanks! @rahuldausa on twitter and slideshare http://www.linkedin.com/in/rahuldausa Find Interesting ? Join us @ http://www.meetup.com/Hyderabad-Apache-Solr-Lucene-Group/ 33