This document discusses Elasticsearch plugins, including:
- The different types of plugins like analysis, scripting, and site plugins
- Integration points in Elasticsearch that can be extended by plugins
- Examples of plugins like custom analyzers and percolators
- Considerations for writing plugins like maintenance, versioning, and testing
- How to package and install plugins within Elasticsearch
7. Harry Potter and the Goblet of Fire
Tokenizer
Harry
Potter
and
the
Goblet
of
Fire
Lower case
filter
harry
potter
and
the
goblet
of
fire
Stop-words
filter
harry
potter
goblet
fire
Step 1: Tokenization
Step 2: Filtering
9. Harry Potter and the Goblet of Fire
Tokenizer
Harry
Potter
and
the
Goblet
of
Fire
Lower case
filter
harry
potter
and
the
goblet
of
fire
Stop-words
filter
harry
potter
goblet
fire
Potter
Tokenizer
Potter
Lower case
filter
potter
Stop-words
filter
potter
QueryIndexing
21. Controlling shard allocation
โข Filtering built in
โ By tags, groups, racks, IPs
โ Black list / white list
โข Total shards per node
โข Disk based
โข EXPERT: Roll your own by implementing
AllocationDecider
23. Transports
โข Exposes the Elasticsearch RESTful API over
protocols other than HTTP
โ Apache Thrift
โ Memcached
โ Servlet
โ Redis
โ ZeroMq
27. Discovery
โข Default is Zen discovery
โ Unicast: I know who my nodes are
โ Multicast: Auto discovery for nodes
โข Multicast discovery support for cloud
environments
โ AWS
โ Azure
โ Google Compute
โข ProTip: Unicast in production unless you know
what youโre doing
โข ZooKeeper plugin
28. Snapshot / restore repositories
โข File system
โข AWS S3
โข HDFS
โข Azure
โข Roll your own (e.g. Glacier)
33. Writing your own plugin: Gotchas
โข Maintenance โ the deeper you go in the API
the harder it is to keep it up to date
โข Versioning and installation on (large) clusters
โ Though can be solved using puppet, docker et al
โข Auxiliary data (like dictionaries etc)
โข Testing & Debugging
34. Code: Writing your own plugin
โข JAR file with bootstrap code:
โข Embed this as es-plugin.properties:
plugin=org.elasticsearch.plugin.example.ExamplePlugin
About me โ freelancing, consultant, lucene.net committer
Rationale: Elasticsearch can do search, aggregations, percolation and at scale
Sometimes we need more than that
This talk: birds eye view. Covering a lot of ground here.
From experience
Skim over EXPERT features
Rationale: Elasticsearch can do search, aggregations, percolation and at scale
Elasticsearch in a nutshell:
REST, JSON wrapping Lucene
Cluster forming and cluster metadata
Server distributes Lucene shards (replication, sharding, multi-tenancy)
Not so interesting since ES discourages the use of them. There are lighter implementations
Only applicable for query_string queries
Have to be done via code, example will follow
Analysis chain very important for indexing
Some queries will still go through the analysis chain (Match family etc)
What is the analysis chain?
Splitting words
Query term should match the indexed term.
Term query is the most basic unit.
Stop words obsolete => common words query
Term query is the most basic building block of a query. Term match is what we need to have
What is the analysis chain?
Splitting words
Query term should match the indexed term.
Term query is the most basic unit.
Stop words obsolete => common words query
Term query is the most basic building block of a query. Term match is what we need to have
Analysis chain should generally match in both ends
There scenarios where they differ on purpose
This is why you can set search_analyzer & index_analyzer
Importance of proper tokenization
Discussion: on what characters should we tokenize?
The curious case of email addresses
This is why you probably want to roll your own analyzer if you are doing a lot of FTS
To finalize my case
Some basic analyzers shipping with Lucene
What happens when you try to
From code โ custom analyzers, token filters & token filters that you can use
Hebrew is a tough language to tackle
HebMorph - Open-source solution (AGPL3)
Requires auxiliary files
Powered by MVEL โ Java-like syntax
Other languages include Groovy, JS, Python
You _could_ implement your own scripting engine
Dynamic scripting disabled by default
Scripts need to be loaded from disk
Function score query: lookup Brittaโs talk
Similarity: replacement for TF/IDF.
Out of the box: BM25, DFR, IB and more
EXPERT EXPERT EXPERT
EXPERT ONLY
Lucene 4.0 feature
Can provide performance boosts for searches and aggregations
Zoom out
In the integration point
Stats, management, โฆ
Some of the built-in features
Roll your own if needed
Con: requires tons of testing, multiple deciders are at play
Thanks to Found
A way to expose new plugin functionality to consumers not using Java
Or leverage the HTTP server capabilities of ES for your requirements
Parsing request, performing action, creating and sending response
Better query filtering for performance (less queries)
Highlighting
More logs + custom logs
Various other optimizations
A la significant terms facet
We could have done this client-side only.
This would have been linear in time
We made this sharded
Java client code
Debugging
A static website that can be served using ES HTTP server
Multicast: the more the merrier problem
Zookepper plugin โ not up to date, not official
Aphyrโs finding re partial partitions
The idea behind them
Why rivers are obsolete: node comes down, backlog
Always prefer push over pull
Official guidance is not to use them going forward
Plugin names specify the folder name under /plugins
Node info API can provide
Donโt be that guy: ignore the urge to write custom stuff
The defaults are good + A lot can be done w/ scripting
Basically, when you really need custom distributed behavior
Or REST endpoint exposed cluster wise
Or EXPERT FEATURES
Aux data โ open ES ticket for enabling analyzers to read docs
JAR (has to be JVM code)
Boilerplate setup and code
Modules; AnalysisBinderProcessor; TransportActions; RestActions;
Everything in Elasticsearch is implemented as an Action
Client / server reuse of request/response classes, when in Java