Second Galway Data Meetup, 29th April 2015
Elasticsearch was originally developed for searching flat documents. However, as real world data is inherently more complex, e.g., nested json data, relational data, interconnected documents and entities, Elasticsearch quickly evolves to support more advanced search scenarios. In this presentation, we will review existing features and plugins to support such scenarios, discuss their advantages and disadvantages, and understand which one is more appropriate for a particular scenario.
2. ● CTO, SIREn Solutions
– Search, Big Data, Knowledge Graph
● Lucene / Solr Contributor
– E.g., Cross Data Center Replication
– Lucene Revolution 2013, 2014
– Lucene In Action, 2nd Edition
● Author of the SIREn plugin
Introducing myself
3. ● Open source search
systems
– Lucene, Solr, Elasticsearch
● Document-based model
– Flat key-value model
– Originally developed for
searching full-text documents
Background
firstname John
lastname
title
Smith
Mr Dr
4. Background
● Data is usually more
complex
– Nested objects
● XML, JSON
● E.g., US patents
– Relations
● RDBMS, RDF, Graph, Documents
with links to entities or other
documents
Article
{
"firstName": "John",
"lastName": "Smith",
"age": 25,
"address" : {
"street" : "21 2nd
Street",
"city" : "New York",
"state" : "NY"
},
"phoneNumber" : [
{ "type" : "home", "number" : "212 555-1234" },
{ "type" : "fax", "number" : "646 555-4567" }
]
}
Person
Company
6. name : Elastic
funding_rounds.round_code : A
funding_rounds.founded_year : 2012
funding_rounds.round_code : B
funding_rounds.founded_year : 2013
funding_rounds.investments.name : Benchmark
funding_rounds.investments.name : Data Collective
funding_rounds.investments.name : Index Ventures
● Pros:
– Relatively easy
– Fast
● Cons:
– Loss of precision, false positive
– Index-time data materialisation
– Data duplication (child)
– Not optimal for updates
Common solutions
7. name : Elastic
f_r.round_code : A
f_r.founded_year : 2012
f_r.inv.name : Benchmarkname : Elastic
f_r.round_code : A
f_r.founded_year : 2012
f_r.inv.name : Data Collectivename : Elastic
f_r.round_code : B
f_r.founded_year : 2013
f_r.inv.name : Benchmarkname : Elastic
f_r.round_code : B
f_r.founded_year : 2013
f_r.inv.name : Index Ventures
● Pros:
– Relatively easy
– No loss of precision
● Cons:
– Index-time data materialisation
– Combinatorial explosion
– Duplicate results: query-time grouping is necessary
– Data duplication (parent and child)
– Not optimal for updates
Common solutions
8. ● Lucene's BlockJoin
– Feature to provide relational search
– “Nested” type in Elasticsearch
● Model
– One (flat) document per record
– Joins computed at index time
– Related documents are indexed in
a same “block”
{
"company": {
"properties" : {
"funding_rounds" : {
"type" : "nested",
"properties" : {
"investments" : {
"type" : "nested"
} } } } } }
Index-time join
9. Index-time join
● Pros:
– Fast (join precomputed, data locality)
– No loss of precision
● Cons:
– Index-time data materialisation
– Data duplication (child)
– Not optimal for updates
– High memory usage for complex nested model
Document Block
name : Elastic
country_code : A
...
round_code : A
founded_year : 2012
...
Name : Data Collective
Type : Org
Name : Benchmark
Type : Org
round_code : B
founded_year : 2013
...
Name : Index Venture
Type : Org
Name : Benchmark
Type : Org
10. Index-time join
● SIREn Plugin
– Plugin to Lucene, Solr, Elasticsearch
– Add native index for nested data type
– http://siren.solutions/siren/overview/
● Model
– One document per “tree”
– Joins computed at index time
– Rich data model (JSON)
● Nested objects, nested arrays, multi-valued
attributes, datatypes
{
"company": {
"properties" : {
"_siren_source" : {
"analyzer" : "concise",
"postings_format" : "Siren10AFor",
"store" : "no",
"type" : "string"
} } } }
11. Index-time join
name : Elastic
country_code : A
...
round_code : A
founded_year : 2012
...
round_code : B
founded_year : 2013
...
Name : Data Collective
Type : Org
Name : Benchmark
Type : Org
Name : Index Venture
Type : Org
Name : Benchmark
Type : Org
● Pros:
– Fast (join precomputed, data locality)
– No loss of precision
– Low memory usage, even for complex nested model
● Cons:
– Index-time data materialisation
– Data duplication (child)
– Not optimal for updates
1
1.1
1.2
1.1.1
1.1.2
1.2.1
1.2.2
13. Query-time join
● Elasticsearch's Parent-Child
– Query-time join for nested data
● Model
– One (flat) document per record
– At index time, child documents should
specify their parent ID with the
_parent field
– Joins computed at query time
{
"company": {},
"investment" : {
"_parent" : {
"type" : "company",
}
},
"investor" : {
"_parent" : {
"type" : "investment",
}
}
}
14. Query-time join
● Pros:
– Update friendly
– No loss of precision
– Data locality: parent and child on same shard
● Cons:
– Slower than index-time solutions
– Larger memory use than nested
– Data duplication (child)
● A child cannot have more than one parent
– Index-time data materialisation
name : Elastic
country_code : A
...
round_code : A
founded_year : 2012
...
Name : Data Collective
Type : Org
Name : Benchmark
Type : Org
round_code : B
founded_year : 2013
...
Name : Index Venture
Type : Org
Name : Benchmark
Type : Org
15. Query-time join
● FilterJoin's Plugin
– Query-time join for relational data
● Inspired from #3278
● Model
– One (flat) document per record
– At index time, documents should specify the IDs of their related documents in
a given field
– At query time, lookup ID values from a given field to filter documents from
another index
16. Query-time join
● Pros:
– Update friendly
– No loss of precision
– No data duplication
– No index-time data materialisation
● Cons:
– Slower than parent-child
– No data locality principle: network transfer
name : Elastic
country_code : A
...
round_code : A
founded_year : 2012
...
Name : Data Collective
Type : Org
round_code : B
founded_year : 2013
...
Name : Index Venture
Type : Org
Name : Benchmark
Type : Org
17. ● Each solution has its own advantages and disadvantages
– Trade-off between performance, scalability and flexibility
BlockJoin SIREn Parent-Child FilterJoin
Performance ++ ++ + -
Scalability + ++ + +
Flexibility - - + ++
Best for ●Simple nested
model
●Fixed data
●Complex nested
model
●Fixed data
●Simple nested
model
●Dynamic data
●Relational model
●Dynamic data
Summary