SlideShare a Scribd company logo
1 of 49
ADVANCED SEARCH 
WITH 
SOLR + DJANGO-HAYSTACK 
MARCEL CHASTAIN 
LA DJANGO – 2014-09-30
WHAT WE’LL COVER 
1. THE PITCH: 
The Problem With Search 
The Solution(s) 
Overall Architecture of System with Django/Solr/Haystack 
2. THE GOOD STUFF: 
Indexing Data for Search 
Querying the Search Index 
Advanced Search Methods 
Resources
THE PITCH 
OR, “WHY ANY OF THIS MATTERS”
THE PROBLEM 
1. Sites with stored information are 
ONLY as useful as they are at 
retrieving and displaying that 
information
THE PROBLEM 
2. Users have high expectations of 
search (thanks, Google)
THE PROBLEM 
2. Users have high expectations of 
search 
• Spelling Suggestions:
THE PROBLEM 
2. Users have high expectations of 
search 
• Hit Highlighting:
THE PROBLEM 
2. Users have high expectations of 
search 
• “Related Searches” 
• Distance/GeoSpatial Search
THE PROBLEM 
2. Users have high expectations of 
search 
• Faceting:
THE PROBLEM 
3. Good search involves lots of 
challenges
THE PROBLEM 
3. Good search involves lots of 
challenges 
• Stemming: 
User Searches For Word “Stem” 
“argue” 
“argues” 
“argued” 
“argu” 
“argument” 
“arguments” 
“argument”
THE PROBLEM 
3. Good search involves lots of 
challenges 
And more..! 
• Synonyms 
• Acronyms 
• Non-ASCII characters 
• Stop words (“and”, “to”, “a”) 
• Calculating relevance 
• Performance with millions/billions(!) of documents
THE SOLUTION 
“Information Retrieval Systems” 
a.k.a Search Engines
THE SOLUTION 
“Information Retrieval Systems” 
a.k.a Search Engines
SOLR 
THE BACKEND
WHAT IS SOLR? 
Open-source enterprise search 
Java-based 
Created in 2004 
Built on Apache Lucene 
Most popular enterprise search engine 
Apache 2.0 License 
Built for millions or billions of documents
WHAT DOES IT DO? 
• Full-text search 
• Hit highlighting 
• Faceted search 
• Clustering/replication/sharding 
• Database integration 
• Rich document (word, pdf, etc) handling 
• Geospatial search 
• Spelling corrections/suggestions 
• … loads and loads more
WHO USES SOLR?
HOW CAN WE USE IT 
WITH DJANGO? 
Haystack 
From the homepage: 
(http://haystacksearch.org/)
LOOK FAMILIAR? 
Query style 
Declarative search index definitions
THE GOOD 
STUFF 
INSTALLING, CONFIGURING & USING 
SOLR/HAYSTACK
WHO DOES WHAT 
Solr: 
• Provides API for submitting to & querying from index 
• Stores actual index data 
• Manages fields/data types in xml config (‘schema.xml’) 
Haystack: 
• Manages connection(s) to solr 
• Provides familiar API for querying 
• Uses templates and declarative search index definitions 
• Helps generate solr xml config 
• Management commands to index content 
• Generic views/forms for common search use-cases 
• Hooks into signals to keep data up-to-date
PART 1: 
LET’S MAKE AN INDEX
0. GITHUB REPO 
git clone https://github.com/marcelchastain/haystackdemo
1. SETUP SOLR 
(from github repo root) 
./solr_download.sh 
(or, manually) 
wget http://apache.mirrors.pair.com/lucene/solr/4.10.1/solr-4.10.1.tgz 
tar –xzvf solr-4.10.1.tgz 
ln –s ./solr-4.10.1 ./solr 
The one file to care about: 
• solr/example/solr/collection1/conf/schema.xml 
Stores field definitions and data types. Frequently updated during 
development
2. RUN SOLR 
(from github repo root) 
./solr_start.sh 
(or, manually) 
cd solr/example && java –jar start.jar 
Requires java 1.7+. To install on debian/ubuntu: 
sudo apt-get install openjdk-7-jre-headless
3. INSTALL HAYSTACK 
(CWD haystackdemo/) 
apt-get install python-pip python-virtualenv 
virtualenv env && source env/bin/activate 
(from github repo root) 
pip install –r requirements.txt 
(or, manually) 
pip install Django==1.6.7 django-haystack
4. HAYSTACK SETTINGS 
INSTALLED_APPS = [ 
# ‘django.contrib.admin’, etc 
‘haystack’, 
# then your usual apps 
‘myapp’, 
] 
HAYSTACK_CONNECTIONS = { 
‘default’: { 
‘ENGINE’: ‘haystack.backends.solr_backend.SolrEngine’, 
‘URL’: ‘http://127.0.0.1:8983/solr’ 
}, 
} 
HAYSTACK_SIGNAL_PROCESSOR = ‘haystack.signals.RealtimeSignalProcessor’
5. THE MODEL(S)
6. SYNCDB & INITIAL DATA 
(CWD haystackdemo/demo/) 
./manage.py syncdb 
./manage.py loaddata restaurants
7. DEFINE SEARCH INDEX 
myapp/search_indexes.py
7.5 BOOSTING FIELD 
RELEVANCE 
Some fields are simply more relevant! 
(Note: changes to field boosts require reindex)
8. CREATE A TEMPLATE 
FOR INDEXED TEXT 
templates/search/indexes/myapp/note_text.txt
9. UPDATE SOLR SCHEMA 
(CWD: haystackdemo/demo/) 
./manage.py build_solr_schema > 
../solr/example/solr/collection1/conf/schema.xml 
Which adds: 
*Restart solr for changes to go into effect
10. REBUILD INDEX 
(CWD hackstackdemo/demo/) 
$ ./manage.py update_index 
Indexing 6 notes
10. REBUILD INDEX 
(CWD hackstackdemo/demo/) 
$ ./manage.py update_index 
Indexing 6 notes
PART 2: 
LET’S GET TO QUERYIN’
SIMPLE 
SEARCHQUERYSETS
GREAT, WHAT ABOUT 
FROM A BROWSER?
EASY MODE 
Full-document search 
urls.py 
templates/search/search.html
HAYSTACK COMPONENTS TO 
EXTEND 
• haystack.forms.SearchForm 
django form with extendable .search() method. Define additional 
fields on the form, then incorporate them in the .search() 
method’s logic 
• haystack.views.SearchView 
Class-based view made to be flexible for common search cases
PART 3: FEATURES
HIT HIGHLIGHTING 
Instead of referring to a context variable directly, use the {% highlight %} tag
SPELLING 
SUGGESTIONS 
Update connection’s settings dictionary + reindex 
Use spelling_suggestion() method
AUTOCOMPLETE 
Create another search index field using EdgeNgramField + reindex 
Use the .autocomplete() method on a SearchQuerySet
FACETING 
Add faceting to search index definition 
Regenerate schema.xml and reindex content 
./manage.py build_solr_schema > 
../solr/example/solr/collection1/conf/schema.xml 
./manage.py update_index
FACETING 
From a shell:
RESOURCES 
LET’S SAVE YOU A GOOGLE TRIP
RESOURCES 
Solr in Action ($45) 
Apr 2014 
Haystack Documentation 
http://django-haystack.readthedocs.org/ 
IRC (freenode): 
#django 
#haystack 
#solr

More Related Content

Similar to Advanced Search with Solr & django-haystack

Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Lucidworks
 
YQL:: Select * from Internet
YQL:: Select * from InternetYQL:: Select * from Internet
YQL:: Select * from Internet
drgath
 
Unifying your data management with Hadoop
Unifying your data management with HadoopUnifying your data management with Hadoop
Unifying your data management with Hadoop
Jayant Shekhar
 
MongoDB Roadmap
MongoDB RoadmapMongoDB Roadmap
MongoDB Roadmap
MongoDB
 

Similar to Advanced Search with Solr & django-haystack (20)

Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
 
ASHviz - Dats visualization research experiments using ASH data
ASHviz - Dats visualization research experiments using ASH dataASHviz - Dats visualization research experiments using ASH data
ASHviz - Dats visualization research experiments using ASH data
 
Web analytics at scale with Druid at naver.com
Web analytics at scale with Druid at naver.comWeb analytics at scale with Druid at naver.com
Web analytics at scale with Druid at naver.com
 
Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014
 
Search Engines: Best Practice
Search Engines: Best PracticeSearch Engines: Best Practice
Search Engines: Best Practice
 
YQL:: Select * from Internet
YQL:: Select * from InternetYQL:: Select * from Internet
YQL:: Select * from Internet
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
 
Owning time series with team apache Strata San Jose 2015
Owning time series with team apache   Strata San Jose 2015Owning time series with team apache   Strata San Jose 2015
Owning time series with team apache Strata San Jose 2015
 
Unifying your data management with Hadoop
Unifying your data management with HadoopUnifying your data management with Hadoop
Unifying your data management with Hadoop
 
Compass Framework
Compass FrameworkCompass Framework
Compass Framework
 
Elasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetupElasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetup
 
Site search analytics workshop presentation
Site search analytics workshop presentationSite search analytics workshop presentation
Site search analytics workshop presentation
 
YQL: Select * from Internet
YQL: Select * from InternetYQL: Select * from Internet
YQL: Select * from Internet
 
Information Retrieval - Data Science Bootcamp
Information Retrieval - Data Science BootcampInformation Retrieval - Data Science Bootcamp
Information Retrieval - Data Science Bootcamp
 
Scaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrScaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solr
 
MongoDB Roadmap
MongoDB RoadmapMongoDB Roadmap
MongoDB Roadmap
 
Assamese search engine using SOLR by Moinuddin Ahmed ( moin )
Assamese search engine using SOLR by Moinuddin Ahmed ( moin )Assamese search engine using SOLR by Moinuddin Ahmed ( moin )
Assamese search engine using SOLR by Moinuddin Ahmed ( moin )
 
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
 
Solr Masterclass Bangkok, June 2014
Solr Masterclass Bangkok, June 2014Solr Masterclass Bangkok, June 2014
Solr Masterclass Bangkok, June 2014
 

Recently uploaded

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Recently uploaded (20)

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 

Advanced Search with Solr & django-haystack

  • 1. ADVANCED SEARCH WITH SOLR + DJANGO-HAYSTACK MARCEL CHASTAIN LA DJANGO – 2014-09-30
  • 2. WHAT WE’LL COVER 1. THE PITCH: The Problem With Search The Solution(s) Overall Architecture of System with Django/Solr/Haystack 2. THE GOOD STUFF: Indexing Data for Search Querying the Search Index Advanced Search Methods Resources
  • 3. THE PITCH OR, “WHY ANY OF THIS MATTERS”
  • 4. THE PROBLEM 1. Sites with stored information are ONLY as useful as they are at retrieving and displaying that information
  • 5. THE PROBLEM 2. Users have high expectations of search (thanks, Google)
  • 6. THE PROBLEM 2. Users have high expectations of search • Spelling Suggestions:
  • 7. THE PROBLEM 2. Users have high expectations of search • Hit Highlighting:
  • 8. THE PROBLEM 2. Users have high expectations of search • “Related Searches” • Distance/GeoSpatial Search
  • 9. THE PROBLEM 2. Users have high expectations of search • Faceting:
  • 10. THE PROBLEM 3. Good search involves lots of challenges
  • 11. THE PROBLEM 3. Good search involves lots of challenges • Stemming: User Searches For Word “Stem” “argue” “argues” “argued” “argu” “argument” “arguments” “argument”
  • 12. THE PROBLEM 3. Good search involves lots of challenges And more..! • Synonyms • Acronyms • Non-ASCII characters • Stop words (“and”, “to”, “a”) • Calculating relevance • Performance with millions/billions(!) of documents
  • 13. THE SOLUTION “Information Retrieval Systems” a.k.a Search Engines
  • 14. THE SOLUTION “Information Retrieval Systems” a.k.a Search Engines
  • 16. WHAT IS SOLR? Open-source enterprise search Java-based Created in 2004 Built on Apache Lucene Most popular enterprise search engine Apache 2.0 License Built for millions or billions of documents
  • 17. WHAT DOES IT DO? • Full-text search • Hit highlighting • Faceted search • Clustering/replication/sharding • Database integration • Rich document (word, pdf, etc) handling • Geospatial search • Spelling corrections/suggestions • … loads and loads more
  • 19. HOW CAN WE USE IT WITH DJANGO? Haystack From the homepage: (http://haystacksearch.org/)
  • 20. LOOK FAMILIAR? Query style Declarative search index definitions
  • 21. THE GOOD STUFF INSTALLING, CONFIGURING & USING SOLR/HAYSTACK
  • 22. WHO DOES WHAT Solr: • Provides API for submitting to & querying from index • Stores actual index data • Manages fields/data types in xml config (‘schema.xml’) Haystack: • Manages connection(s) to solr • Provides familiar API for querying • Uses templates and declarative search index definitions • Helps generate solr xml config • Management commands to index content • Generic views/forms for common search use-cases • Hooks into signals to keep data up-to-date
  • 23. PART 1: LET’S MAKE AN INDEX
  • 24. 0. GITHUB REPO git clone https://github.com/marcelchastain/haystackdemo
  • 25. 1. SETUP SOLR (from github repo root) ./solr_download.sh (or, manually) wget http://apache.mirrors.pair.com/lucene/solr/4.10.1/solr-4.10.1.tgz tar –xzvf solr-4.10.1.tgz ln –s ./solr-4.10.1 ./solr The one file to care about: • solr/example/solr/collection1/conf/schema.xml Stores field definitions and data types. Frequently updated during development
  • 26. 2. RUN SOLR (from github repo root) ./solr_start.sh (or, manually) cd solr/example && java –jar start.jar Requires java 1.7+. To install on debian/ubuntu: sudo apt-get install openjdk-7-jre-headless
  • 27. 3. INSTALL HAYSTACK (CWD haystackdemo/) apt-get install python-pip python-virtualenv virtualenv env && source env/bin/activate (from github repo root) pip install –r requirements.txt (or, manually) pip install Django==1.6.7 django-haystack
  • 28. 4. HAYSTACK SETTINGS INSTALLED_APPS = [ # ‘django.contrib.admin’, etc ‘haystack’, # then your usual apps ‘myapp’, ] HAYSTACK_CONNECTIONS = { ‘default’: { ‘ENGINE’: ‘haystack.backends.solr_backend.SolrEngine’, ‘URL’: ‘http://127.0.0.1:8983/solr’ }, } HAYSTACK_SIGNAL_PROCESSOR = ‘haystack.signals.RealtimeSignalProcessor’
  • 30. 6. SYNCDB & INITIAL DATA (CWD haystackdemo/demo/) ./manage.py syncdb ./manage.py loaddata restaurants
  • 31. 7. DEFINE SEARCH INDEX myapp/search_indexes.py
  • 32. 7.5 BOOSTING FIELD RELEVANCE Some fields are simply more relevant! (Note: changes to field boosts require reindex)
  • 33. 8. CREATE A TEMPLATE FOR INDEXED TEXT templates/search/indexes/myapp/note_text.txt
  • 34. 9. UPDATE SOLR SCHEMA (CWD: haystackdemo/demo/) ./manage.py build_solr_schema > ../solr/example/solr/collection1/conf/schema.xml Which adds: *Restart solr for changes to go into effect
  • 35. 10. REBUILD INDEX (CWD hackstackdemo/demo/) $ ./manage.py update_index Indexing 6 notes
  • 36. 10. REBUILD INDEX (CWD hackstackdemo/demo/) $ ./manage.py update_index Indexing 6 notes
  • 37. PART 2: LET’S GET TO QUERYIN’
  • 39. GREAT, WHAT ABOUT FROM A BROWSER?
  • 40. EASY MODE Full-document search urls.py templates/search/search.html
  • 41. HAYSTACK COMPONENTS TO EXTEND • haystack.forms.SearchForm django form with extendable .search() method. Define additional fields on the form, then incorporate them in the .search() method’s logic • haystack.views.SearchView Class-based view made to be flexible for common search cases
  • 43. HIT HIGHLIGHTING Instead of referring to a context variable directly, use the {% highlight %} tag
  • 44. SPELLING SUGGESTIONS Update connection’s settings dictionary + reindex Use spelling_suggestion() method
  • 45. AUTOCOMPLETE Create another search index field using EdgeNgramField + reindex Use the .autocomplete() method on a SearchQuerySet
  • 46. FACETING Add faceting to search index definition Regenerate schema.xml and reindex content ./manage.py build_solr_schema > ../solr/example/solr/collection1/conf/schema.xml ./manage.py update_index
  • 47. FACETING From a shell:
  • 48. RESOURCES LET’S SAVE YOU A GOOGLE TRIP
  • 49. RESOURCES Solr in Action ($45) Apr 2014 Haystack Documentation http://django-haystack.readthedocs.org/ IRC (freenode): #django #haystack #solr