Elasticsearch - SEARCH & ANALYZE DATA IN REAL TIME

ELASTICSEARCH
SEARCH & ANALYZE DATA IN REAL TIME*
Piotr Pelczar • github • stackoverflow
Wrocław 2017, Eurobank
freeimages.com v 1.2

AGENDA
You will find out:
• purpose
• how data is stored and searched
• features + 3rd party
• architecture
• usecase on production

AGENDA
You will not find out:
• production ready configuration: HA, repl, sharding
• monitoring
• production internal bottlenecks / failure recovery
• ELK stack – elasticsearch + logstash + kibana
• comparison of ES i Solr/Sphinx

PURPOSE
• NoSQL
• Databse for full text search
• More reads then writes
every document update is a creation of a new one
• No transactions – BASE instead of ACID

FULL-TEXT SEARCH
• full-text search (FTS) refers to techniques of efficient
search the data simillar to natural language text
• search is performed using full text index
• under the hood ES is powered by Apache Lucene

FULL-TEXT SEARCH
FTS is available in Oracle, MsSQL, MySQL, but...
• KILLER FEATURE: ES enables to customize the proces
of building the full text index
• features like:
– autocomplete
– „Did you mean?” based Levenstein distance
– indexing one field in a several ways

ES IS POWERED BY LUCENE
Features added:
• clustering
• sharding (horizontal scaling)
• replication (copy of shards)
• versioning
• non-full-text indices
• REST API
https://pl.pinterest.com/pin/528328600014757803/

BASE vs ACID
• Atomicity
• Consistency
• Isolation
• Durability
• Basically Available
• Soft state
• Eventual consistency
http://wallpaperswide.com/domino_effect-wallpapers.html

TERMINOLOGY
• Cluster
• Node
• Index (collection of docs) <-> (?) tablespace/database
• Type (partition of index) <-> (?) table/collection
• Document (JSON format)
• Shard & Replicas
https://wallpaperscraft.com/download/london_philharmonic_orchestra_scene_show_play_conductor_8925/2560x1080

DOCUMENT-ORIENTED
I SCHEMA-FREE
• data is stored as documents
• documents are unstructured *
* by default, but there is a possibility to require strict o partly strict schema in the type
definition in index
• all fields are indexed in full-text by default *
* this behaviour is fully configurable, data type can be changed or
the field can be ignored in full-text index
http://www.shximai.com/education-wallpapers.html

RANKING FORMULA
BM25 similarity function
https://www.slideshare.net/Hadoop_Summit/t-435p212leauv2
1. http://ipl.cs.aueb.gr/stougiannis/bm25_2.html
2. https://en.wikipedia.org/wiki/Tf%E2%80%93idf
q - query
t – term
d - document

ANALYZERS
Analysis - the process of converting text into tokens or
terms which are added to the inverted index for
searching. Analysis is performed by an analyzer.
• Index time analyser
• Search time analyser
http://wallpaperswide.com/glasses_and_book-wallpapers.html

ANALYZERS
1. Tokenizing a block into terms
2. Normalizing (reducing) terms into root form
• Every field in document type can have own analyser
• Fields can be indexed by several analysers
(multi-fields)
https://www.elastic.co/guide/en/elasticsearch/reference/current/multi-
fields.html#_multi_fields_with_multiple_analyzers

ANALYZERS
1. Character Filters
– html and entities
– triming
2. Tokenizer
3. Token Filters
– stopwords
– Stemmer (root form)
– Phonetic, n-grams
– Synonim
– Patten capture
Input
Filter
Tokenizer
Token
filter
Index

ANALYZERS
Polish analyser – Stempel
sudo bin/plugin install analysis-stempel

CUSTOM ANALYSERS
PUT /index_name
{
"settings": {
"analysis": {
"analyzer": {
"polskie_slowa": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "stopwords_polska", "polish_stem", "asciifolding"],
"char_filter": ["html_strip"]
}
},
"filter": {
"stopwords_polska": {
"type": "stop",
"stopwords": ["a", "aby", ...]
}
}
}
}
// ...

TESTING ANALYSER
GET /_analyze
{
"analyzer": "standard",
"text": "Text to analyze"
}
{
"tokens": [
{
"token": "text",
"start_offset": 0,
"end_offset": 4,
"type": "<ALPHANUM>",
"position": 1
},
// ..

INSERT
PUT /website/blog/123
{
"title": "My first blog entry",
"text": "Just trying this out...",
"date": "2014/01/01"
}
Custom ID - PUT
POST /website/blog/
Auto ID - POST

INSERT
By default:
• There is no document schema
• All fields are indexed in full-text index
• Type of encountered previously unknown field
is determined by the first value that appeared
(dynamic mapping)
https://www.elastic.co/guide/en/elasticsearch/guide/current/dynamic-mapping.html
https://studyinthestates.dhs.gov/2017/01/new-interim-final-guidance-open-for-comment-until-feb-27

INSERT – DYNAMIC MAPPING
dyanmic
- true (fields out of schema are allowed)
- false (new fields are just ignored)
- strict (exception, when unknown field)

INSERT – DYNAMIC MAPPING
PUT /my_index
{
"mappings": {
"my_type": {
"dynamic": "strict",
"properties": {
"title": { "type": "string"},
"stash": {
"type": "object",
"dynamic": true
}
}
}
}
}

MAPPING
Data types:
• Core
– numeric, date, boolean, binary, string (keyword, fulltext)
• Complex
– array, object, nested (array of objects)
– geo
• Multi-fields
– e.x. date as raw date and full-text value (movie title YEAR), or field
with multiple analysers
• Specialized
– ip, completion, …

MAPPING
"mappings": {
"news": {
"properties": {
"title": {
"type": "string",
"analyzer": "polskie_slowa",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
},
"date_published": {
"type": "date",
},
multifield

MAPPING
"mappings": {
"news": {
"properties": {
"stories": {
"type": "nested",
"properties": {
"id": {
"type": "long",
},
"title": {
"type": "string",
"analyzer": "polskie_slowa"
}
}
}

MAPPING
"mappings": {
"news": {
"properties": {
"media": {
"type": "object",
"properties": {
"gallery_has": {
"type": "boolean",
},
"video_has": {
"type": "boolean",
},
"poll_has": {
"type": "boolean",
},

GET
GET /website/blog/123
{
"_index" : "website",
"_type" : "blog",
"_id" : "123",
"_version" : 1,
"found" : true,
"_source" : {
"date": "2014/01/01"
}
}
Status code: 200, 404

DELETE
DELETE /website/blog/123

UPDATE
PUT /website/blog/123
{
"date": "2014/01/01"
}
POST /website/blog/123/_update
{
"title": "My first blog entry„
}

UPDATE – IMMUTABLE DOCS
• Documents are immutable, because Lucene
segments are immutable
• The new version of documents are created
• Old version is marked in .del file
• Previous version is still searchable, but is
removed from search result in the runtime
until cleanup process

URI SEARCH
GET /index/_search?q=user:kimchy
GET /index/type1,type2/_search?q=user:kimchy
GET /index1,index2/type/_search?q=user:kimchy
GET /_all/type/_search?q=user:kimchy
GET /_search?q=user:kimchy

URI SEARCH
{
"timed_out": false,
"took": 62,
"hits":{
"total" : 1,
"max_score": 1.3862944,
"hits" : [
{
"_index" : "twitter",
"_type" : "tweet",
"_id" : "0",
"_score": 1.3862944,
"_source" : {
"user" : "kimchy",
"date" : "2009-11-15T14:12:12",
"message" : "trying out Elasticsearch",
"likes": 0
}

TERM QUERY
GET /index/type/_search
{
"query" : {
"term" : { "user" : "kimchy" }
}
}

SEARCH
Search segments:
• must-match
• should-match (scoring)
• fuzzy-query (Levenstein desitnation)
• filter (without scoring, very fast)
• limit/offset

TERM QUERY & BOOST
"query": {
"bool": {
"should" : [
{
"match": { "title": { "query": "myśliwy", "boost": 1 }
},
],
"filter": {
"and" : [
{
"term": {
"media.gallery_has": false
}
}
]
}

HIGHLIGHTING
GET /_search
{
"query" : {
"match": { "content": "kimchy" }
},
"highlight" : {
"fields" : {
"content" : {}
}
}
}

SUGGESTING
POST music/_search?pretty
{
"suggest": {
"song-suggest" : {
"prefix" : "nir",
"completion" : {
"field" : "suggest"
}
}
}
}

SUGGESTING
"suggest": {
"song-suggest" : [ {
"text" : "nir",
"offset" : 0,
"length" : 3,
"options" : [ {
"text" : "Nirvana",
"_index": "music",
"_type": "song",
"_id": "1",
"_score": 1.0,
"_source": {
"suggest": ["Nevermind", "Nirvana"]
}
} ]

SCORE FUNCTIONS
• Weight
• Field Value factor
• Decay functions
https://www.elastic.co/guide/en/elasticsearch/guide/current/decay-functions.html

SCORE FUNCTIONS - DECAY
Model functions:
• Gauss
• Exp
• Linear
• Multivalue
– 2d
– 3d
– n …

SCORE FUNCTIONS - DECAY
"function_score": {
"query": { ... },
"score_mode": "multiply", // how functions are compared
"boost_mode": "multiply", // how functions has impact to
original score
"functions": {
"date_published": {
"origin": " ... "
"offset": " ... "
"scale": " ... "
}
}

CHANGE
https://www.entrepreneur.com/article/269669

REINDEX
POST _reindex
{
"source": {
"index": "twitter"
},
"dest": {
"index": "new_twitter"
}
}
Every index change needs index rebuild (reindex) :
• Change data types, schema
• Analyser modification
• Shard key, numer of shards (no way to rebalance data)

REINDEX
POST _reindex
{
"source": {
"index": "twitter",
"type": "tweet",
"query": {
"term": {
"user": "kimchy"
}
}
},
"dest": {
"index": "new_twitter"
}
}
2x more space is needed, but there is the possibility to
do query width limit and offset and delete data from
old index in the meantime.
Application have to query
both indices/aliases during
reindex process and filter
duplicates in the runtime.

Index Aliases and Zero Downtime
POST /_aliases
{
"actions" : [
{ "remove" : { "index" : "news_v1",
"alias" : "news_view" } },
{ "add" : { "index" : "news_v2",
"alias" : "news_view" } }
]
}
https://www.elastic.co/guide/en/
elasticsearch/guide/current/index-aliases.html

* REAL TIME, NEAR REALTIME (NRT)
This means is there is a slight latency
(normally one second) from the time you index a
document until the time it becomes searchable.
http://all-free-download.com/wallpapers/animals/horse_racing_wallpaper_horses_animals_wallpaper_382.html

NEAR REALTIME (NRT)
1. Data have to be analysed
2. There are caveats with Lucene segments

NRT i BASE
In the system there is no global-lock
-> append-only, immutable
-> no transactions
-> >> response, availability => PROFIT!
-> data could be not visible immediately
Stale data, but any data...
http://wallpaperswide.com/domino_effect-wallpapers.html

NEAR REALTIME (NRT)
• New index is created in new Lucene Segments periodicaly
• Creation new Lucene Segments are called refresh
• 1 sec by default - NRT
• INSERT -> GET can respond 404.

NEAR REALTIME (NRT)
• Lucene allows to open a new Segment and search in it
without Commit
• Commit makes Lucene Segment immutable
• Document is always added to a new Segment that temporarily
resides in memory (file system cache) and are searchable
• To save the Segment fsync is needed - expensive
• Data are not stored on disk immediately, but there is no global
lock (BASE)

PERSISTENCE - TRANSLOG
1. New documents are stored in translog and in-memory buffer
(at this point there are not searchable)
2. At the periodically refresh process they are copied to a new
Segment that resides also in memory
– from now, document are searchable
3. When Segment Commit occurs (fsync’ed into disk), the
document is removed from translog
4. Commit and translog cleanup is called flush
periodically or when translog is too big
http://sixthjudicialdistrict.sleekup.com/wp-content/uploads/2015/08/Judge-holding-gavel.jpg

PERSISTENCE - TRANSLOG
1. Translog is configurable per index
– translog is fsynce’d every 5 sec by default
– after every insert/update/delete/index
– translog is committed after bulk-insert – worth to use
2. Can be configured as async with interval:
PUT /my_index/_settings
{
"index.translog.durability": "async",
"index.translog.sync_interval": "5s"
}

PER-OPERATION PERSISTENCE

SCALING
sharding and replication

NODES IN CLUSTER
• There is one primary shard (in replicas)
• Shard in any node can become the primary

CRUD IN CLUSTER
• Request can be handled by any node
this node will coordinate the request
• CRUD is performed on Primary Shard first and
replicated to Replicas

CONSISTENCY
• Quorum by default (majority of Shard copies)
– floor( (primary + number_of_replicas) / 2 ) + 1
– Can be:
• Quorum
• One
• All
• Defining timeout is recommened (1min by default)
– Elasticsearch will be waiting until all needed responses
appear. In the case of timeout an application should take
the decision what to do
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html#index-wait-
for-active-shards

READ FROM CLUSTER
• Any node can receive request and coordinates it
• Results will be fetched from nodes
By default, coordinate node will choose different shard copy on every
request in order to rebalance reads (round robin)

NODES IN CLUSTER
• Every node knows where data lives (information
about shard key), so can route the request
for client this approach is transparent, can talk with
any node it want
• If client keep connections to multiple nodes, there is
no Single Point of Failure
• Round-robin, to distribute the load

DISTRIBUTED READ
QUERY PHASE / FETCH PHASE
https://www.elastic.co/guide/en/elasticsearch/guide/current/distributed-search.html

CONNECTIONS
• REST API
– connection should be kept alive (choose proper library)
• Native client
– Binary protocol
– Designed fot inter-node communication

DOCUMENT VERSIONING
• Optimistic concurency tool
• Version types
– Internal (1, 2, 3 …)
– external or external_gt
– external_gte
POST /index/type?version=TIMESTAMP

Search in whole ElasticSearch dataset personalized for specific user
(based on neo4j graph relations)
https://neo4j.com/developer/elastic-search/

Thanks! Q&A
http://doineedbackup.com/customer-acquisition-channels-for-saas/

Thanks! Q&A
https://highspark.co/how-to-end-a-presentation/

Elasticsearch - SEARCH & ANALYZE DATA IN REAL TIME

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Elasticsearch - SEARCH & ANALYZE DATA IN REAL TIME

Similar to Elasticsearch - SEARCH & ANALYZE DATA IN REAL TIME (20)

More from Piotr Pelczar

More from Piotr Pelczar (7)

Recently uploaded

Recently uploaded (20)

Elasticsearch - SEARCH & ANALYZE DATA IN REAL TIME