SlideShare a Scribd company logo
1 of 57
Download to read offline
What is the best
full text search engine
for Python?
Andrii Soldatenko
23-24 April 2016
@a_soldatenko
Agenda:
• Who am I?
• What is full text search?
• PostgreSQL FTS / Elastic / Whoosh
• Pros and Cons
• What’s next?
Andrii Soldatenko
• Backend Python Developer at
• CTO in Persollo.com
• Speaker at many PyCons and
Python meetups
• blogger at https://asoldatenko.com
Preface
Text Search
➜ cpython time ack OrderedDict
ack OrderedDict 2.53s user 0.22s system 94% cpu 2.915 total
➜ cpython time pt OrderedDict
pt OrderedDict 0.14s user 0.12s system 406% cpu 0.064 total
➜ cpython time pss OrderedDict
pss OrderedDict 1.08s user 0.14s system 88% cpu 1.370 total
➜ cpython time grep -r -i 'OrderedDict' .
grep -r -i 'OrderedDict' 2.70s user 0.13s system 94% cpu 2.998 total
Full text search
Search index
Simple sentences
1. The quick brown fox jumped over the lazy dog
2. Quick brown foxes leap over lazy dogs in summer
Inverted index
Inverted index
Inverted index:
normalization
Term Doc_1 Doc_2
-------------------------
brown | X | X
dog | X | X
fox | X | X
in | | X
jump | X | X
lazy | X | X
over | X | X
quick | X | X
summer | | X
the | X | X
------------------------
Term Doc_1 Doc_2
-------------------------
Quick | | X
The | X |
brown | X | X
dog | X |
dogs | | X
fox | X |
foxes | | X
in | | X
jumped | X |
lazy | X | X
leap | | X
over | X | X
quick | X |
summer | | X
the | X |
------------------------
Search Engines
PostgreSQL
Full Text Search
support from version 8.3
PostgreSQL
Full Text Search
SELECT to_tsvector('text') @@
to_tsquery('query');
Simple is better than complex. - by import this
SELECT 'python conference ukraine 2016'::tsvector @@
'python & ukraine'::tsquery;
?column?
----------
t
(1 row)
Do PostgreSQL FTS
without index
Do PostgreSQL FTS
with index
CREATE INDEX name ON table USING GIN
(column);
CREATE INDEX name ON table USING
GIST (column);
PostgreSQL FTS:

Ranking Search Results
ts_rank() -> float4 - based on the
frequency of their matching lexemes
ts_rank_cd() -> float4 - cover
density ranking for the given
document vector and query
PostgresSQL FTS
Highlighting Results
SELECT ts_headline('english',
'python conference ukraine 2016',
to_tsquery('python & 2016'));
ts_headline
----------------------------------------------
<b>python</b> conference ukraine <b>2016</b>
Stop Words
postgresql/9.5.2/share/postgresql/tsearch_data/english.stop
PostgresSQL FTS
Stop Words
SELECT to_tsvector('in the list of stop words');
to_tsvector
----------------------------
'list':3 'stop':5 'word':6
PG FTS

and Python
• Django 1.10 django.contrib.postgres.search
(36 hours ago)
• djorm-ext-pgfulltext
• sqlalchemy
PostgreSQL FTS
integration with django orm
https://github.com/linuxlewis/djorm-ext-pgfulltext
from djorm_pgfulltext.models import SearchManager
from djorm_pgfulltext.fields import VectorField
from django.db import models
class Page(models.Model):
name = models.CharField(max_length=200)
description = models.TextField()
search_index = VectorField()
objects = SearchManager(
fields = ('name', 'description'),
config = 'pg_catalog.english', # this is default
search_field = 'search_index', # this is default
auto_update_search_field = True
)
For search just use search
method of the manager
https://github.com/linuxlewis/djorm-ext-pgfulltext
>>> Page.objects.search("documentation & about")
[<Page: Page: Home page>]
>>> Page.objects.search("about | documentation | django | home", raw=True)
[<Page: Page: Home page>, <Page: Page: About>, <Page: Page: Navigation>]
Django 1.10
>>> Entry.objects.filter(body_text__search='recipe')
[<Entry: Cheese on Toast recipes>, <Entry: Pizza
recipes>]
>>> Entry.objects.annotate(
... search=SearchVector('blog__tagline',
'body_text'),
... ).filter(search='cheese')
[
<Entry: Cheese on Toast recipes>,
<Entry: Pizza Recipes>,
<Entry: Dairy farming in Argentina>,
]
https://github.com/django/django/commit/2d877da
Pros and Cons
Pros:
• Quick implementation
• No dependency
Cons:
• Need manually manage indexes
• depend on PostgreSQL
• no analytics data
• no DSL only `&` and `|` queries
• difficult to manage stop words
ElasticSearch
Who uses ElasticSearch?
ElasticSearch:
Quick Intro
Relational DB Databases TablesRows Columns
ElasticSearch Indices FieldsTypes Documents
ElasticSearch:
Locks
•Pessimistic concurrency control
•Optimistic concurrency control
ElasticSearch and
Python
• elasticsearch-py
• elasticsearch-dsl-py by Honza Kral
• elasticsearch-py-async by Honza Kral
ElasticSearch:
FTS
$ curl -XGET 'http://localhost:9200/
pyconua/talk/_search' -d '
{
    "query": {
        "match": {
            "user": "Andrii"
        }
    }
}'
ES: Create Index
$ curl -XPUT 'http://localhost:9200/
twitter/' -d '{
    "settings" : {
        "index" : {
            "number_of_shards" : 3,
            "number_of_replicas" : 2
        }
    }
}'
ES: Add json to Index
$ curl -XPUT 'http://localhost:9200/
pyconua/talk/1' -d '{
    "user" : "andrii",
    "description" : "Full text search"
}'
ES: Stopwords
$ curl -XPUT 'http://localhost:9200/pyconua' -d '{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_english": {
          "type": "english",
          "stopwords_path": "stopwords/english.txt"
        }
      }
    }
  }
}'
ES: Highlight
$ curl -XGET 'http://localhost:9200/pyconua/talk/
_search' -d '{
    "query" : {...},
    "highlight" : {
        "pre_tags" : ["<tag1>"],
        "post_tags" : ["</tag1>"],
        "fields" : {
            "_all" : {}
        }
    }
}'
ES: Relevance
$ curl -XGET 'http://localhost:9200/_search?explain -d
'
{
"query" : { "match" : { "user" : "andrii" }}
}'
"_explanation": {
  "description": "weight(tweet:honeymoon in 0)
                  [PerFieldSimilarity], result of:",
  "value": 0.076713204,
  "details": [...]
}
Whoosh
• Pure-Python
• Whoosh was created by Matt Chaput.
• Pluggable scoring algorithm (including BM25F)
• more info at video from PyCon US 2013
Whoosh: Stop words
import os.path
import textwrap
names = os.listdir("stopwords")
for name in names:
f = open("stopwords/" + name)
wordls = [line.strip() for line in f]
words = " ".join(wordls)
print '"%s": frozenset(u"""' % name
print textwrap.fill(words, 72)
print '""".split())'
http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/
snowball/stopwords/
Whoosh: 

Highlight
results = pycon.search(myquery)
for hit in results:
print(hit["title"])
# Assume "content" field is stored
print(hit.highlights("content"))
Whoosh: 

Ranking search results
• Pluggable scoring algorithm
• including BM25F
Haystack
Adding search functionality
to Simple Model
$ cat myapp/models.py
from django.db import models
from django.contrib.auth.models import User
class Page(models.Model):
user = models.ForeignKey(User)
name = models.CharField(max_length=200)
description = models.TextField()
def __unicode__(self):
return self.name
Haystack: Installation
$ pip install django-haystack
$ cat settings.py
INSTALLED_APPS = [
'django.contrib.admin',
'django.contrib.auth',
'django.contrib.contenttypes',
'django.contrib.sessions',
'django.contrib.sites',
# Added.
'haystack',
# Then your usual apps...
'blog',
]
Haystack: Installation
$ pip install elasticsearch
$ cat settings.py
...
HAYSTACK_CONNECTIONS = {
'default': {
'ENGINE': 'haystack.backends.elasticsearch_backend.ElasticsearchSearchEngine',
'URL': 'http://127.0.0.1:9200/',
'INDEX_NAME': 'haystack',
},
}
...
Haystack:
Creating SearchIndexes
$ cat myapp/search_indexes.py
import datetime
from haystack import indexes
from myapp.models import Note
class PageIndex(indexes.SearchIndex, indexes.Indexable):
text = indexes.CharField(document=True, use_template=True)
author = indexes.CharField(model_attr='user')
pub_date = indexes.DateTimeField(model_attr='pub_date')
def get_model(self):
return Note
def index_queryset(self, using=None):
"""Used when the entire index for model is updated."""
return self.get_model().objects. 
filter(pub_date__lte=datetime.datetime.now())
Haystack:
SearchQuerySet API
from haystack.query import SearchQuerySet
from haystack.inputs import Raw
all_results = SearchQuerySet().all()
hello_results = SearchQuerySet().filter(content='hello')
unfriendly_results = SearchQuerySet().
exclude(content=‘hello’).
filter(content=‘world’)
# To send unescaped data:
sqs = SearchQuerySet().filter(title=Raw(trusted_query))
Keeping data in sync
# Update everything.
./manage.py update_index --settings=settings.prod
# Update everything with lots of information about what's going on.
./manage.py update_index --settings=settings.prod --verbosity=2
# Update everything, cleaning up after deleted models.
./manage.py update_index --remove --settings=settings.prod
# Update everything changed in the last 2 hours.
./manage.py update_index --age=2 --settings=settings.prod
# Update everything between Dec. 1, 2011 & Dec 31, 2011
./manage.py update_index --start='2011-12-01T00:00:00' --end='2011-12-31T23:59:59' --
settings=settings.prod
Signals
class RealtimeSignalProcessor(BaseSignalProcessor):
"""
Allows for observing when saves/deletes fire & automatically updates the
search engine appropriately.
"""
def setup(self):
# Naive (listen to all model saves).
models.signals.post_save.connect(self.handle_save)
models.signals.post_delete.connect(self.handle_delete)
# Efficient would be going through all backends & collecting all models
# being used, then hooking up signals only for those.
def teardown(self):
# Naive (listen to all model saves).
models.signals.post_save.disconnect(self.handle_save)
models.signals.post_delete.disconnect(self.handle_delete)
# Efficient would be going through all backends & collecting all models
# being used, then disconnecting signals only for those.
Haystack:
Pros and Cons
Pros:
• easy to setup
• looks like Django ORM but for searches
• search engine independent
• support 4 engines (Elastic, Solr, Xapian, Whoosh)
Cons:
• poor SearchQuerySet API
• difficult to manage stop words
• loose performance, because extra layer
• Model - based
Results
Python 

clients
Python 3
Django

support
elasticsearch-py

elasticsearch-dsl-
py

elasticsearch-py-
async
YES
haystack +

elasticstack

psycopg2 YES
djorm-ext-
pgfulltext

django.contrib.po
stgres
Whoosh YES
support using
haystack
ResultsIndexes Without indexes
PUT /index/ No support
GIN/GIST to_tsvector()
index folder No support
Results
ranking /
relevance
Configure

Stopwords
highlight
search
results
TF/IDF YES YES
cd_rank YES YES
Okapi BM25 YES YES
ResultsSynonyms Scale
YES YES
NO SUPPORT I’m not sure
NO SUPPORT NO
Final Thoughts
Questions
?
Thank You
andrii.soldatenko@toptal.com
@a_soldatenko
https://asoldatenko.com
We are hiring
https://www.toptal.com/#connect-
fantastic-computer-engineers

More Related Content

More from Andrii Soldatenko

What is the best full text search engine for Python?
What is the best full text search engine for Python?What is the best full text search engine for Python?
What is the best full text search engine for Python?Andrii Soldatenko
 
PyCon Russian 2015 - Dive into full text search with python.
PyCon Russian 2015 - Dive into full text search with python.PyCon Russian 2015 - Dive into full text search with python.
PyCon Russian 2015 - Dive into full text search with python.Andrii Soldatenko
 
PyCon 2015 Belarus Andrii Soldatenko
PyCon 2015 Belarus Andrii SoldatenkoPyCon 2015 Belarus Andrii Soldatenko
PyCon 2015 Belarus Andrii SoldatenkoAndrii Soldatenko
 
SeleniumCamp 2015 Andrii Soldatenko
SeleniumCamp 2015 Andrii SoldatenkoSeleniumCamp 2015 Andrii Soldatenko
SeleniumCamp 2015 Andrii SoldatenkoAndrii Soldatenko
 

More from Andrii Soldatenko (6)

What is the best full text search engine for Python?
What is the best full text search engine for Python?What is the best full text search engine for Python?
What is the best full text search engine for Python?
 
Kyiv.py #16 october 2015
Kyiv.py #16 october 2015Kyiv.py #16 october 2015
Kyiv.py #16 october 2015
 
PyCon Russian 2015 - Dive into full text search with python.
PyCon Russian 2015 - Dive into full text search with python.PyCon Russian 2015 - Dive into full text search with python.
PyCon Russian 2015 - Dive into full text search with python.
 
PyCon 2015 Belarus Andrii Soldatenko
PyCon 2015 Belarus Andrii SoldatenkoPyCon 2015 Belarus Andrii Soldatenko
PyCon 2015 Belarus Andrii Soldatenko
 
PyCon Ukraine 2014
PyCon Ukraine 2014PyCon Ukraine 2014
PyCon Ukraine 2014
 
SeleniumCamp 2015 Andrii Soldatenko
SeleniumCamp 2015 Andrii SoldatenkoSeleniumCamp 2015 Andrii Soldatenko
SeleniumCamp 2015 Andrii Soldatenko
 

Recently uploaded

办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书zdzoqco
 
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一Fs
 
NSX-T and Service Interfaces presentation
NSX-T and Service Interfaces presentationNSX-T and Service Interfaces presentation
NSX-T and Service Interfaces presentationMarko4394
 
Q4-1-Illustrating-Hypothesis-Testing.pptx
Q4-1-Illustrating-Hypothesis-Testing.pptxQ4-1-Illustrating-Hypothesis-Testing.pptx
Q4-1-Illustrating-Hypothesis-Testing.pptxeditsforyah
 
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一z xss
 
Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170Sonam Pathan
 
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作ys8omjxb
 
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书rnrncn29
 
PHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationPHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationLinaWolf1
 
Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Paul Calvano
 
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Sonam Pathan
 
Contact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New DelhiContact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New Delhimiss dipika
 
SCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is prediSCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is predieusebiomeyer
 
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书rnrncn29
 
Magic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMagic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMartaLoveguard
 
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一Fs
 
Top 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxTop 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxDyna Gilbert
 
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)Christopher H Felton
 

Recently uploaded (20)

办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
 
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
 
NSX-T and Service Interfaces presentation
NSX-T and Service Interfaces presentationNSX-T and Service Interfaces presentation
NSX-T and Service Interfaces presentation
 
Q4-1-Illustrating-Hypothesis-Testing.pptx
Q4-1-Illustrating-Hypothesis-Testing.pptxQ4-1-Illustrating-Hypothesis-Testing.pptx
Q4-1-Illustrating-Hypothesis-Testing.pptx
 
Hot Sexy call girls in Rk Puram 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in  Rk Puram 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in  Rk Puram 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Rk Puram 🔝 9953056974 🔝 Delhi escort Service
 
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
 
Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170
 
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
 
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
 
PHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationPHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 Documentation
 
Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24
 
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
 
Contact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New DelhiContact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New Delhi
 
SCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is prediSCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is predi
 
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
 
Magic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMagic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptx
 
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
 
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
 
Top 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxTop 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptx
 
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
 

PyCon UA 2016

  • 1. What is the best full text search engine for Python? Andrii Soldatenko 23-24 April 2016 @a_soldatenko
  • 2. Agenda: • Who am I? • What is full text search? • PostgreSQL FTS / Elastic / Whoosh • Pros and Cons • What’s next?
  • 3. Andrii Soldatenko • Backend Python Developer at • CTO in Persollo.com • Speaker at many PyCons and Python meetups • blogger at https://asoldatenko.com
  • 5. Text Search ➜ cpython time ack OrderedDict ack OrderedDict 2.53s user 0.22s system 94% cpu 2.915 total ➜ cpython time pt OrderedDict pt OrderedDict 0.14s user 0.12s system 406% cpu 0.064 total ➜ cpython time pss OrderedDict pss OrderedDict 1.08s user 0.14s system 88% cpu 1.370 total ➜ cpython time grep -r -i 'OrderedDict' . grep -r -i 'OrderedDict' 2.70s user 0.13s system 94% cpu 2.998 total
  • 8. Simple sentences 1. The quick brown fox jumped over the lazy dog 2. Quick brown foxes leap over lazy dogs in summer
  • 11. Inverted index: normalization Term Doc_1 Doc_2 ------------------------- brown | X | X dog | X | X fox | X | X in | | X jump | X | X lazy | X | X over | X | X quick | X | X summer | | X the | X | X ------------------------ Term Doc_1 Doc_2 ------------------------- Quick | | X The | X | brown | X | X dog | X | dogs | | X fox | X | foxes | | X in | | X jumped | X | lazy | X | X leap | | X over | X | X quick | X | summer | | X the | X | ------------------------
  • 14. PostgreSQL Full Text Search SELECT to_tsvector('text') @@ to_tsquery('query'); Simple is better than complex. - by import this
  • 15. SELECT 'python conference ukraine 2016'::tsvector @@ 'python & ukraine'::tsquery; ?column? ---------- t (1 row) Do PostgreSQL FTS without index
  • 16. Do PostgreSQL FTS with index CREATE INDEX name ON table USING GIN (column); CREATE INDEX name ON table USING GIST (column);
  • 17. PostgreSQL FTS:
 Ranking Search Results ts_rank() -> float4 - based on the frequency of their matching lexemes ts_rank_cd() -> float4 - cover density ranking for the given document vector and query
  • 18. PostgresSQL FTS Highlighting Results SELECT ts_headline('english', 'python conference ukraine 2016', to_tsquery('python & 2016')); ts_headline ---------------------------------------------- <b>python</b> conference ukraine <b>2016</b>
  • 20. PostgresSQL FTS Stop Words SELECT to_tsvector('in the list of stop words'); to_tsvector ---------------------------- 'list':3 'stop':5 'word':6
  • 21. PG FTS
 and Python • Django 1.10 django.contrib.postgres.search (36 hours ago) • djorm-ext-pgfulltext • sqlalchemy
  • 22. PostgreSQL FTS integration with django orm https://github.com/linuxlewis/djorm-ext-pgfulltext from djorm_pgfulltext.models import SearchManager from djorm_pgfulltext.fields import VectorField from django.db import models class Page(models.Model): name = models.CharField(max_length=200) description = models.TextField() search_index = VectorField() objects = SearchManager( fields = ('name', 'description'), config = 'pg_catalog.english', # this is default search_field = 'search_index', # this is default auto_update_search_field = True )
  • 23. For search just use search method of the manager https://github.com/linuxlewis/djorm-ext-pgfulltext >>> Page.objects.search("documentation & about") [<Page: Page: Home page>] >>> Page.objects.search("about | documentation | django | home", raw=True) [<Page: Page: Home page>, <Page: Page: About>, <Page: Page: Navigation>]
  • 24. Django 1.10 >>> Entry.objects.filter(body_text__search='recipe') [<Entry: Cheese on Toast recipes>, <Entry: Pizza recipes>] >>> Entry.objects.annotate( ... search=SearchVector('blog__tagline', 'body_text'), ... ).filter(search='cheese') [ <Entry: Cheese on Toast recipes>, <Entry: Pizza Recipes>, <Entry: Dairy farming in Argentina>, ] https://github.com/django/django/commit/2d877da
  • 25. Pros and Cons Pros: • Quick implementation • No dependency Cons: • Need manually manage indexes • depend on PostgreSQL • no analytics data • no DSL only `&` and `|` queries • difficult to manage stop words
  • 28. ElasticSearch: Quick Intro Relational DB Databases TablesRows Columns ElasticSearch Indices FieldsTypes Documents
  • 30. ElasticSearch and Python • elasticsearch-py • elasticsearch-dsl-py by Honza Kral • elasticsearch-py-async by Honza Kral
  • 31. ElasticSearch: FTS $ curl -XGET 'http://localhost:9200/ pyconua/talk/_search' -d ' {     "query": {         "match": {             "user": "Andrii"         }     } }'
  • 32. ES: Create Index $ curl -XPUT 'http://localhost:9200/ twitter/' -d '{     "settings" : {         "index" : {             "number_of_shards" : 3,             "number_of_replicas" : 2         }     } }'
  • 33. ES: Add json to Index $ curl -XPUT 'http://localhost:9200/ pyconua/talk/1' -d '{     "user" : "andrii",     "description" : "Full text search" }'
  • 34. ES: Stopwords $ curl -XPUT 'http://localhost:9200/pyconua' -d '{   "settings": {     "analysis": {       "analyzer": {         "my_english": {           "type": "english",           "stopwords_path": "stopwords/english.txt"         }       }     }   } }'
  • 35. ES: Highlight $ curl -XGET 'http://localhost:9200/pyconua/talk/ _search' -d '{     "query" : {...},     "highlight" : {         "pre_tags" : ["<tag1>"],         "post_tags" : ["</tag1>"],         "fields" : {             "_all" : {}         }     } }'
  • 36. ES: Relevance $ curl -XGET 'http://localhost:9200/_search?explain -d ' { "query" : { "match" : { "user" : "andrii" }} }' "_explanation": {   "description": "weight(tweet:honeymoon in 0)                   [PerFieldSimilarity], result of:",   "value": 0.076713204,   "details": [...] }
  • 37. Whoosh • Pure-Python • Whoosh was created by Matt Chaput. • Pluggable scoring algorithm (including BM25F) • more info at video from PyCon US 2013
  • 38. Whoosh: Stop words import os.path import textwrap names = os.listdir("stopwords") for name in names: f = open("stopwords/" + name) wordls = [line.strip() for line in f] words = " ".join(wordls) print '"%s": frozenset(u"""' % name print textwrap.fill(words, 72) print '""".split())' http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/ snowball/stopwords/
  • 39. Whoosh: 
 Highlight results = pycon.search(myquery) for hit in results: print(hit["title"]) # Assume "content" field is stored print(hit.highlights("content"))
  • 40. Whoosh: 
 Ranking search results • Pluggable scoring algorithm • including BM25F
  • 42. Adding search functionality to Simple Model $ cat myapp/models.py from django.db import models from django.contrib.auth.models import User class Page(models.Model): user = models.ForeignKey(User) name = models.CharField(max_length=200) description = models.TextField() def __unicode__(self): return self.name
  • 43. Haystack: Installation $ pip install django-haystack $ cat settings.py INSTALLED_APPS = [ 'django.contrib.admin', 'django.contrib.auth', 'django.contrib.contenttypes', 'django.contrib.sessions', 'django.contrib.sites', # Added. 'haystack', # Then your usual apps... 'blog', ]
  • 44. Haystack: Installation $ pip install elasticsearch $ cat settings.py ... HAYSTACK_CONNECTIONS = { 'default': { 'ENGINE': 'haystack.backends.elasticsearch_backend.ElasticsearchSearchEngine', 'URL': 'http://127.0.0.1:9200/', 'INDEX_NAME': 'haystack', }, } ...
  • 45. Haystack: Creating SearchIndexes $ cat myapp/search_indexes.py import datetime from haystack import indexes from myapp.models import Note class PageIndex(indexes.SearchIndex, indexes.Indexable): text = indexes.CharField(document=True, use_template=True) author = indexes.CharField(model_attr='user') pub_date = indexes.DateTimeField(model_attr='pub_date') def get_model(self): return Note def index_queryset(self, using=None): """Used when the entire index for model is updated.""" return self.get_model().objects. filter(pub_date__lte=datetime.datetime.now())
  • 46. Haystack: SearchQuerySet API from haystack.query import SearchQuerySet from haystack.inputs import Raw all_results = SearchQuerySet().all() hello_results = SearchQuerySet().filter(content='hello') unfriendly_results = SearchQuerySet(). exclude(content=‘hello’). filter(content=‘world’) # To send unescaped data: sqs = SearchQuerySet().filter(title=Raw(trusted_query))
  • 47. Keeping data in sync # Update everything. ./manage.py update_index --settings=settings.prod # Update everything with lots of information about what's going on. ./manage.py update_index --settings=settings.prod --verbosity=2 # Update everything, cleaning up after deleted models. ./manage.py update_index --remove --settings=settings.prod # Update everything changed in the last 2 hours. ./manage.py update_index --age=2 --settings=settings.prod # Update everything between Dec. 1, 2011 & Dec 31, 2011 ./manage.py update_index --start='2011-12-01T00:00:00' --end='2011-12-31T23:59:59' -- settings=settings.prod
  • 48. Signals class RealtimeSignalProcessor(BaseSignalProcessor): """ Allows for observing when saves/deletes fire & automatically updates the search engine appropriately. """ def setup(self): # Naive (listen to all model saves). models.signals.post_save.connect(self.handle_save) models.signals.post_delete.connect(self.handle_delete) # Efficient would be going through all backends & collecting all models # being used, then hooking up signals only for those. def teardown(self): # Naive (listen to all model saves). models.signals.post_save.disconnect(self.handle_save) models.signals.post_delete.disconnect(self.handle_delete) # Efficient would be going through all backends & collecting all models # being used, then disconnecting signals only for those.
  • 49. Haystack: Pros and Cons Pros: • easy to setup • looks like Django ORM but for searches • search engine independent • support 4 engines (Elastic, Solr, Xapian, Whoosh) Cons: • poor SearchQuerySet API • difficult to manage stop words • loose performance, because extra layer • Model - based
  • 50. Results Python 
 clients Python 3 Django
 support elasticsearch-py
 elasticsearch-dsl- py
 elasticsearch-py- async YES haystack +
 elasticstack
 psycopg2 YES djorm-ext- pgfulltext
 django.contrib.po stgres Whoosh YES support using haystack
  • 51. ResultsIndexes Without indexes PUT /index/ No support GIN/GIST to_tsvector() index folder No support
  • 53. ResultsSynonyms Scale YES YES NO SUPPORT I’m not sure NO SUPPORT NO