The document discusses full text search in Python. It begins with an introduction to the speaker and covers information explosion and text search tools like grep. It then explains search indexes and inverted indexes using examples. The document discusses normalization in indexes and search in databases like PostgreSQL. It describes operators for textual data types in PostgreSQL for matching strings and regular expressions.
1. Dive into
full text search
with Python
Andrii Soldatenko
18-19 September 2015
@a_soldatenko
2. About me:
• Lead QA Automation Engineer at
• Backend Python Developer at
• Speaker at PyCon Ukraine 2014
• Speaker at PyCon Belarus 2015
• @a_soldatenko
8. Simple sentences
1. The quick brown fox jumped over the lazy dog
2. Quick brown foxes leap over lazy dogs in summer
9. Inverted index
Term
Doc_1
Doc_2
-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐
Quick
|
|
X
The
|
X
|
brown
|
X
|
X
dog
|
X
|
dogs
|
|
X
fox
|
X
|
foxes
|
|
X
in
|
|
X
jumped
|
X
|
lazy
|
X
|
X
leap
|
|
X
over
|
X
|
X
quick
|
X
|
summer
|
|
X
the
|
X
|
-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐
10. Inverted index
Term
Doc_1
Doc_2
-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐
brown
|
X
|
X
quick
|
X
|
-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐
Total
|
2
|
1
11. Inverted index:
normalization
Term
Doc_1
Doc_2
-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐
brown
|
X
|
X
dog
|
X
|
X
fox
|
X
|
X
in
|
|
X
jump
|
X
|
X
lazy
|
X
|
X
over
|
X
|
X
quick
|
X
|
X
summer
|
|
X
the
|
X
|
X
-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐
Term
Doc_1
Doc_2
-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐
Quick
|
|
X
The
|
X
|
brown
|
X
|
X
dog
|
X
|
dogs
|
|
X
fox
|
X
|
foxes
|
|
X
in
|
|
X
jumped
|
X
|
lazy
|
X
|
X
leap
|
|
X
over
|
X
|
X
quick
|
X
|
summer
|
|
X
the
|
X
|
-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐
16. Full text search in
PostgreSQL
1. Creating tokens
2. Converting tokens into Lexemes
3. Storing preprocessed documents
17. Full text search in
PostgreSQL
27 built-in configurations for 10 languages
Support of user-defined FTS configurations
Pluggable dictionaries, parsers
Inverted indexes
18. functions to convert
normal text to tsvector
explain
SELECT
'a
fat
cat
sat
on
a
mat
and
ate
a
fat
rat'::tsvector
@@
'cat
&
rat’::tsquery;
QUERY
PLAN
-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐
Result
(cost=0.00..0.01
rows=1
width=0)
(1
row)
explain
SELECT
'fat
&
cow'::tsquery
@@
'a
fat
cat
sat
on
a
mat
and
ate
a
fat
rat'::tsvector;
-‐-‐
false
QUERY
PLAN
-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐
Result
(cost=0.00..0.01
rows=1
width=0)
(1
row)
19. PostgreSQL:
index management
CREATE
FUNCTION
notes_vector_update()
RETURNS
TRIGGER
AS
$$
BEGIN
IF
TG_OP
=
'INSERT'
THEN
new.search_index
=
to_tsvector('pg_catalog.english',
COALESCE(NEW.name,
''));
END
IF;
IF
TG_OP
=
'UPDATE'
THEN
IF
NEW.name
<>
OLD.name
THEN
new.search_index
=
to_tsvector('pg_catalog.english',
COALESCE(NEW.name,
''));
END
IF;
END
IF;
RETURN
NEW;
END
$$
LANGUAGE
'plpgsql';
20. PostgreSQL:
stopwords
SELECT
to_tsvector('english','in
the
list
of
stop
words');
to_tsvector
-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐
'list':3
'stop':5
'word':6
/usr/pgsql-9.3/share/tsearch_data/english.stop
22. Malcolm Tredinnick's Advice
on Writing SQL in Django :
“︎If you need to write advanced SQL you should write it.
I would balance that by cautioning against
overuse of the raw() and extra() methods.”
23. PostgreSQL full-text search
integration with django orm
https://github.com/linuxlewis/djorm-ext-pgfulltext
from
djorm_pgfulltext.models
import
SearchManager
from
djorm_pgfulltext.fields
import
VectorField
from
django.db
import
models
class
Page(models.Model):
name
=
models.CharField(max_length=200)
description
=
models.TextField()
search_index
=
VectorField()
objects
=
SearchManager(
fields
=
('name',
'description'),
config
=
'pg_catalog.english',
#
this
is
default
search_field
=
'search_index',
#
this
is
default
auto_update_search_field
=
True
)
24. For search just use search
method of the manager
https://github.com/linuxlewis/djorm-ext-pgfulltext
>>>
Page.objects.search("documentation
&
about")
[<Page:
Page:
Home
page>]
>>>
Page.objects.search("about
|
documentation
|
django
|
home",
raw=True)
[<Page:
Page:
Home
page>,
<Page:
Page:
About>,
<Page:
Page:
Navigation>]
25. Second way
class
Page(models.Model):
name
=
models.CharField(max_length=200)
description
=
models.TextField()
objects
=
SearchManager(fields=None,
search_field=None)
>>>
Page.objects.search("documentation
&
about",
fields=('name',
'description'))
[<Page:
Page:
Home
page>]
>>>
Page.objects.search("about
|
documentation
|
django
|
home",
raw=True,
fields=('name',
'description'))
[<Page:
Page:
Home
page>,
<Page:
Page:
About>,
<Page:
Page:
Navigation>]
26. Pros and Cons
Pros:
• Quick implementation
• No dependency
Cons:
• Need manually manage indexes
• Not as flexible as pure search engines
• tied to PostgreSQL
• no analytics data
• no DSL only `&` and `|` queries
• difficult to manage stop words
35. Adding search functionality
to Simple Model
$
cat
myapp/models.py
from
django.db
import
models
from
django.contrib.auth.models
import
User
class
Page(models.Model):
user
=
models.ForeignKey(User)
name
=
models.CharField(max_length=200)
description
=
models.TextField()
def
__unicode__(self):
return
self.name
38. Haystack:
Creating SearchIndexes
$
cat
myapp/search_indexes.py
import
datetime
from
haystack
import
indexes
from
myapp.models
import
Note
class
PageIndex(indexes.SearchIndex,
indexes.Indexable):
text
=
indexes.CharField(document=True,
use_template=True)
author
=
indexes.CharField(model_attr='user')
pub_date
=
indexes.DateTimeField(model_attr='pub_date')
def
get_model(self):
return
Note
def
index_queryset(self,
using=None):
"""Used
when
the
entire
index
for
model
is
updated."""
return
self.get_model().objects.
filter(pub_date__lte=datetime.datetime.now())
39. Haystack:
SearchQuerySet API
from
haystack.query
import
SearchQuerySet
from
haystack.inputs
import
Raw
all_results
=
SearchQuerySet().all()
hello_results
=
SearchQuerySet().filter(content='hello')
unfriendly_results
=
SearchQuerySet().
exclude(content=‘hello’).
filter(content=‘world’)
#
To
send
unescaped
data:
sqs
=
SearchQuerySet().filter(title=Raw(trusted_query))
40. Keeping data in sync
#
Update
everything.
./manage.py
update_index
-‐-‐settings=settings.prod
#
Update
everything
with
lots
of
information
about
what's
going
on.
./manage.py
update_index
-‐-‐settings=settings.prod
-‐-‐verbosity=2
#
Update
everything,
cleaning
up
after
deleted
models.
./manage.py
update_index
-‐-‐remove
-‐-‐settings=settings.prod
#
Update
everything
changed
in
the
last
2
hours.
./manage.py
update_index
-‐-‐age=2
-‐-‐settings=settings.prod
#
Update
everything
between
Dec.
1,
2011
&
Dec
31,
2011
./manage.py
update_index
-‐-‐start='2011-‐12-‐01T00:00:00'
-‐-‐
end='2011-‐12-‐31T23:59:59'
-‐-‐settings=settings.prod
41. Signals
class
RealtimeSignalProcessor(BaseSignalProcessor):
"""
Allows
for
observing
when
saves/deletes
fire
&
automatically
updates
the
search
engine
appropriately.
"""
def
setup(self):
#
Naive
(listen
to
all
model
saves).
models.signals.post_save.connect(self.handle_save)
models.signals.post_delete.connect(self.handle_delete)
#
Efficient
would
be
going
through
all
backends
&
collecting
all
models
#
being
used,
then
hooking
up
signals
only
for
those.
def
teardown(self):
#
Naive
(listen
to
all
model
saves).
models.signals.post_save.disconnect(self.handle_save)
models.signals.post_delete.disconnect(self.handle_delete)
#
Efficient
would
be
going
through
all
backends
&
collecting
all
models
#
being
used,
then
disconnecting
signals
only
for
those.
42. Haystack:
Pros and Cons
Pros:
• easy to setup
• looks like Django ORM but for searches
• search engine independent
• support 4 engines (Elastic, Solr, Xapian, Whoosh)
Cons:
• poor SearchQuerySet API
• difficult to manage stop words
• loose performance, because extra layer
• Model - based
43. Future FTS and
Roadmap Django 1.9
• PostgreSQL Full Text Search (Marc Tamlyn)
https://github.com/django/django/pull/4726
• Custom indexes (Marc Tamlyn)
• etc.