SlideShare a Scribd company logo
1 of 32
Download to read offline
Elasticsearch
Sharding Strategy at
Tubular Labs
How we arrived at a sharding strategy
Our Elasticsearch Infrastructure?
• 3 clusters for search/aggregations
• 1 small autocomplete cluster
• 1 medium sized cluster for internal use
• 1 Elastic Stack cluster
Our Elasticsearch Clusters
© 2016 Tubular Labs
3
• 2.5 billion documents
• 4TB not including replicas
• Constant indexing load with periodic spikes
• Queries range from simple search request to heavy terms aggregations
• Not many concurrent queries, but queries can be demanding
• Cluster is very CPU heavy
• Recently migrated from Elasticsearch 1.7 to 2.3
Our Largest Cluster
© 2016 Tubular Labs
4
• We have to reindex anyway
• Our dataset has grown substantially
• Performance wasn’t great
• We don’t want to have to reindex in the near future
Migrating to 2.x is a good time to reconsider sharding
© 2016 Tubular Labs
5
Sharding Strategy
● How many shards should I have per index?
● How large should my shards be?
● How many shards should I have per node?
● What hardware/instance type should I use?
Sharding Questions...
© 2016 Tubular Labs
7
• How large is your dataset?
• How fast will your dataset grow?
• What kinds of queries are you running?
• How fast will usage grow?
• When do you want to reindex next?
• I’m sure there are more...
It Depends...
© 2016 Tubular Labs
8
How do we get answers?
© 2016 Tubular Labs
9
Repeatable Elasticsearch Experiments
What We Want
• Repeatable
• Others can easily run the same tests and should get about the same results
• Easily modified
• Easy to define and understand
• Easy to run
• understandable results
Repeatable Elasticsearch Experiments:
© 2016 Tubular Labs
11
• Benchmarking framework for Elasticsearch
• Easily define a set of repeatable tests
• Tests are defined in JSON
• Compare different configurations
• Sets up a single node cluster for tests or
target existing (external) clusters
• Targeting external clusters is not fully supported
and you’ll get warnings telling you as much
What is Rally?
© 2016 Tubular Labs
12
Terms
•Track - a benchmarking scenario
•Car - system (Elasticsearch) configuration for a
benchmark
•Challenge - what benchmarks are run and its
configuration
•Race - an actual run of the benchmark
•Tournaments - A way to analyze the impact of
changes
What is Rally?
© 2016 Tubular Labs
13
Example track config
https://gist.github.com/mdelaney/b710fb3d25fabf7818f471bd4abe70a5
How does Rally work?
© 2016 Tubular Labs
14
Our Experiments and Results
NOTE: The following experiments are written as we would do them next time. Due to time
constraints we had to do some of this in parallel. I’ll also mention where we deviated from
what is in the next few slides.
• We’re still pretty new at running benchmarks with Elasticsearch so we’re still learning the
best way to do this.
• Running these tests answered a lot of questions (and raised brand new ones)
How we used this at Tubular Labs
© 2016 Tubular Labs
16
How big should my shards be?
Determining a good shard size
© 2016 Tubular Labs
17
The experiment
1. Obtain a realistic data set
2. Write the Rally config to:
• Index your data (single shard)
• Run a set of common queries
3. Run benchmark with different document counts
4. Graph the results
Determining a good shard size
© 2016 Tubular Labs
18
The queries we used
• Query A and B:
• Very similar but aggregate on a slightly different set of terms
• Hits about 10% of our dataset
• Query C and D:
• Same aggregations as queries A and B
• Full dataset
Determining a good shard size
© 2016 Tubular Labs
19
Our results
Determining a good shard size
© 2016 Tubular Labs
20
We need to consider
• How fast do you need each query to be?
• How much do you expect your data set to grow before you want to look at reindexing
again?
• Your use case likely will have other concerns as well
Determining a good shard size
© 2016 Tubular Labs
21
How many shards per node?
Determining how many shards per node
© 2016 Tubular Labs
22
The experiment (almost the same as before)
1. Obtain a dataset of realistic data
2. Write the Rally config to:
• Index your data
• Run a set of common queries
3. Run benchmark with different shard counts
4. Graph the results
Determining how many shards per node
© 2016 Tubular Labs
23
What we did differently this time (time constraints)
• Used the Apache HTTP Benchmark Tool with a script to run the queries.
• Our production cluster had 26 data nodes with about 200 million documents each
• Wanted to avoid expanding the cluster further if at all possible (c3.8xlarge is pricey!)
• 10 total shards per node (about 20 million docs/shard)
• 16 total shards per node (about 12.5 million docs/shard)
• 32 total shards per node (about 6.25 million docs/shard)
• Tested on 3 node clusters (2 data nodes, 1 client/master)
Determining how many shards per node
© 2016 Tubular Labs
24
Our Results - Testing Number of Shards per node
Query response by shard count (C 1) Query response by shard count (C 3)
© 2016 Tubular Labs
25
Our Results - Testing Number of Shards per node
Query response production vs test (C 1) Query response production vs test (C 3)
© 2016 Tubular Labs
26
Production - 26 data nodes
Test Cluster - 2 data nodes
• Significant performance drop in each level of testing, why?
• A single shard on a single node performed much better than our
multiple shards per node tests
• The fully loaded 3 node cluster performed much better than our full
cluster in production
• Impact of moving to a machine with more memory
• Will the extra file system cache make a large difference?
New Questions Raised
© 2016 Tubular Labs
27
Query load isn’t evenly distributed
Current path of performance investigation
© 2016 Tubular Labs
28
1 4
3* 2*
5* 8*
10 13*
11 6*
2 5
7* 4*
10* 9*
11* 12*
14 15
3 6
1* 9
13 8
12 7
15* 14*
Problems We Encountered
Rally related
• Document count in track.json != the
document count Rally checks at the end
of indexing with nested documents.
• Multi node support not yet available
Problems We Encountered?
© 2016 Tubular Labs
30
Non Rally related
•Performance in reality wasn’t as good as our testing suggested it should be
• We haven’t found the reason for this yet
• We’ve noticed a correlation between the number of shards a query hits per node and the time taken to run the
query on the shard but have not yet identified the bottleneck.
• We were able to mitigate this by adding additional data nodes
Problems We Encountered?
© 2016 Tubular Labs
31
Thank You!
Questions??

More Related Content

What's hot

Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana
Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, RocanaSolr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana
Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana
Lucidworks
 
Data- How Does It Work-
Data- How Does It Work-Data- How Does It Work-
Data- How Does It Work-
Boyang Niu
 

What's hot (20)

Scaling an ELK stack at bol.com
Scaling an ELK stack at bol.comScaling an ELK stack at bol.com
Scaling an ELK stack at bol.com
 
3.1.Performance and BigData Ecosystem
3.1.Performance and BigData Ecosystem3.1.Performance and BigData Ecosystem
3.1.Performance and BigData Ecosystem
 
Presto changes
Presto changesPresto changes
Presto changes
 
Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana
Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, RocanaSolr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana
Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana
 
Stor4NFV: Exploration of Cloud native Storage in OPNFV - Ren Qiaowei, Wang Hui
Stor4NFV: Exploration of Cloud native Storage in OPNFV - Ren Qiaowei, Wang HuiStor4NFV: Exploration of Cloud native Storage in OPNFV - Ren Qiaowei, Wang Hui
Stor4NFV: Exploration of Cloud native Storage in OPNFV - Ren Qiaowei, Wang Hui
 
Data- How Does It Work-
Data- How Does It Work-Data- How Does It Work-
Data- How Does It Work-
 
Scaling Up with PHP and AWS
Scaling Up with PHP and AWSScaling Up with PHP and AWS
Scaling Up with PHP and AWS
 
OLAP Architecture
OLAP ArchitectureOLAP Architecture
OLAP Architecture
 
Managing your CF templates as a code with python and troposphere
Managing your CF templates as a code with python and troposphereManaging your CF templates as a code with python and troposphere
Managing your CF templates as a code with python and troposphere
 
NRD: Nagios Result Distributor
NRD: Nagios Result DistributorNRD: Nagios Result Distributor
NRD: Nagios Result Distributor
 
Big data: Loading your data with flume and sqoop
Big data:  Loading your data with flume and sqoopBig data:  Loading your data with flume and sqoop
Big data: Loading your data with flume and sqoop
 
Sql Server Best Practices
Sql Server Best PracticesSql Server Best Practices
Sql Server Best Practices
 
InfluxDB: Upgrade to 0.10 considerations
InfluxDB: Upgrade to 0.10 considerationsInfluxDB: Upgrade to 0.10 considerations
InfluxDB: Upgrade to 0.10 considerations
 
Speed Up Uber's Presto with Alluxio
Speed Up Uber's Presto with AlluxioSpeed Up Uber's Presto with Alluxio
Speed Up Uber's Presto with Alluxio
 
Developing Scylla Applications: Practical Tips
Developing Scylla Applications: Practical TipsDeveloping Scylla Applications: Practical Tips
Developing Scylla Applications: Practical Tips
 
Scylla Summit 2022: New AWS Instances Perfect for ScyllaDB
Scylla Summit 2022: New AWS Instances Perfect for ScyllaDBScylla Summit 2022: New AWS Instances Perfect for ScyllaDB
Scylla Summit 2022: New AWS Instances Perfect for ScyllaDB
 
Presto Summit 2018 - 07 - Lyft
Presto Summit 2018 - 07 - LyftPresto Summit 2018 - 07 - Lyft
Presto Summit 2018 - 07 - Lyft
 
How the Automation of a Benchmark Famework Keeps Pace with the Dev Cycle at I...
How the Automation of a Benchmark Famework Keeps Pace with the Dev Cycle at I...How the Automation of a Benchmark Famework Keeps Pace with the Dev Cycle at I...
How the Automation of a Benchmark Famework Keeps Pace with the Dev Cycle at I...
 
Fast dataarchitecture
Fast dataarchitectureFast dataarchitecture
Fast dataarchitecture
 
Logs aggregation and analysis
Logs aggregation and analysisLogs aggregation and analysis
Logs aggregation and analysis
 

Viewers also liked

The Rise of Multi-Platform Video: Why Brands Need a Multi-Platform Video Stra...
The Rise of Multi-Platform Video: Why Brands Need a Multi-Platform Video Stra...The Rise of Multi-Platform Video: Why Brands Need a Multi-Platform Video Stra...
The Rise of Multi-Platform Video: Why Brands Need a Multi-Platform Video Stra...
Ogilvy Consulting
 
Buying a Website Design? 5 tips to avoid mistakes!
Buying a Website Design? 5 tips to avoid mistakes!Buying a Website Design? 5 tips to avoid mistakes!
Buying a Website Design? 5 tips to avoid mistakes!
Jean-Christophe Bougle
 
Elasticsearch 實戰介紹
Elasticsearch 實戰介紹Elasticsearch 實戰介紹
Elasticsearch 實戰介紹
Kang-min Liu
 

Viewers also liked (8)

The Millennial Woman on YouTube Study
The Millennial Woman on YouTube StudyThe Millennial Woman on YouTube Study
The Millennial Woman on YouTube Study
 
Facebook Video: Insights, Trends & Best Practices
Facebook Video: Insights, Trends & Best PracticesFacebook Video: Insights, Trends & Best Practices
Facebook Video: Insights, Trends & Best Practices
 
Tubular Labs Overview
Tubular Labs OverviewTubular Labs Overview
Tubular Labs Overview
 
The Rise of Multi-Platform Video: Why Brands Need a Multi-Platform Video Stra...
The Rise of Multi-Platform Video: Why Brands Need a Multi-Platform Video Stra...The Rise of Multi-Platform Video: Why Brands Need a Multi-Platform Video Stra...
The Rise of Multi-Platform Video: Why Brands Need a Multi-Platform Video Stra...
 
Buying a Website Design? 5 tips to avoid mistakes!
Buying a Website Design? 5 tips to avoid mistakes!Buying a Website Design? 5 tips to avoid mistakes!
Buying a Website Design? 5 tips to avoid mistakes!
 
TrustRadius Conversion Rate Optimization Survey Results 2014
TrustRadius Conversion Rate Optimization Survey Results 2014TrustRadius Conversion Rate Optimization Survey Results 2014
TrustRadius Conversion Rate Optimization Survey Results 2014
 
Elasticsearch 實戰介紹
Elasticsearch 實戰介紹Elasticsearch 實戰介紹
Elasticsearch 實戰介紹
 
An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.
 

Similar to Elasticsearch Sharding Strategy at Tubular Labs

Index Provisioning for ALM Search - My Presentation
Index Provisioning for ALM Search - My PresentationIndex Provisioning for ALM Search - My Presentation
Index Provisioning for ALM Search - My Presentation
Sunita Shrivastava
 

Similar to Elasticsearch Sharding Strategy at Tubular Labs (20)

Benchmarking Elastic Cloud Big Data Services under SLA Constraints
Benchmarking Elastic Cloud Big Data Services under SLA ConstraintsBenchmarking Elastic Cloud Big Data Services under SLA Constraints
Benchmarking Elastic Cloud Big Data Services under SLA Constraints
 
Seven deadly sins of ElasticSearch Benchmarking
Seven deadly sins of ElasticSearch BenchmarkingSeven deadly sins of ElasticSearch Benchmarking
Seven deadly sins of ElasticSearch Benchmarking
 
performance_tuning.pdf
performance_tuning.pdfperformance_tuning.pdf
performance_tuning.pdf
 
performance_tuning.pdf
performance_tuning.pdfperformance_tuning.pdf
performance_tuning.pdf
 
Rally--OpenStack Benchmarking at Scale
Rally--OpenStack Benchmarking at ScaleRally--OpenStack Benchmarking at Scale
Rally--OpenStack Benchmarking at Scale
 
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
 
NoSQL Overview
NoSQL OverviewNoSQL Overview
NoSQL Overview
 
Index Provisioning for ALM Search - My Presentation
Index Provisioning for ALM Search - My PresentationIndex Provisioning for ALM Search - My Presentation
Index Provisioning for ALM Search - My Presentation
 
The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the Cloud
 
20141206 4 q14_dataconference_i_am_your_db
20141206 4 q14_dataconference_i_am_your_db20141206 4 q14_dataconference_i_am_your_db
20141206 4 q14_dataconference_i_am_your_db
 
Use of a Levy Distribution for Modeling Best Case Execution Time Variation
Use of a Levy Distribution for Modeling Best Case Execution Time VariationUse of a Levy Distribution for Modeling Best Case Execution Time Variation
Use of a Levy Distribution for Modeling Best Case Execution Time Variation
 
Fastest Servlets in the West
Fastest Servlets in the WestFastest Servlets in the West
Fastest Servlets in the West
 
Observer, a "real life" time series application
Observer, a "real life" time series applicationObserver, a "real life" time series application
Observer, a "real life" time series application
 
Presentation cmg2016 capacity management essentials-boston
Presentation   cmg2016 capacity management essentials-bostonPresentation   cmg2016 capacity management essentials-boston
Presentation cmg2016 capacity management essentials-boston
 
DefCore: The Interoperability Standard for OpenStack
DefCore: The Interoperability Standard for OpenStackDefCore: The Interoperability Standard for OpenStack
DefCore: The Interoperability Standard for OpenStack
 
Benchmarking Apache Druid
Benchmarking Apache DruidBenchmarking Apache Druid
Benchmarking Apache Druid
 
Benchmarking Apache Druid
Benchmarking Apache Druid Benchmarking Apache Druid
Benchmarking Apache Druid
 
Scaling with sync_replication using Galera and EC2
Scaling with sync_replication using Galera and EC2Scaling with sync_replication using Galera and EC2
Scaling with sync_replication using Galera and EC2
 
InteropWG Intro & Vertical Programs (May. 2017)
InteropWG Intro & Vertical Programs (May. 2017)InteropWG Intro & Vertical Programs (May. 2017)
InteropWG Intro & Vertical Programs (May. 2017)
 
Is It Fast? : Measuring MongoDB Performance
Is It Fast? : Measuring MongoDB PerformanceIs It Fast? : Measuring MongoDB Performance
Is It Fast? : Measuring MongoDB Performance
 

Recently uploaded

TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 

Recently uploaded (20)

How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 

Elasticsearch Sharding Strategy at Tubular Labs

  • 1. Elasticsearch Sharding Strategy at Tubular Labs How we arrived at a sharding strategy
  • 3. • 3 clusters for search/aggregations • 1 small autocomplete cluster • 1 medium sized cluster for internal use • 1 Elastic Stack cluster Our Elasticsearch Clusters © 2016 Tubular Labs 3
  • 4. • 2.5 billion documents • 4TB not including replicas • Constant indexing load with periodic spikes • Queries range from simple search request to heavy terms aggregations • Not many concurrent queries, but queries can be demanding • Cluster is very CPU heavy • Recently migrated from Elasticsearch 1.7 to 2.3 Our Largest Cluster © 2016 Tubular Labs 4
  • 5. • We have to reindex anyway • Our dataset has grown substantially • Performance wasn’t great • We don’t want to have to reindex in the near future Migrating to 2.x is a good time to reconsider sharding © 2016 Tubular Labs 5
  • 7. ● How many shards should I have per index? ● How large should my shards be? ● How many shards should I have per node? ● What hardware/instance type should I use? Sharding Questions... © 2016 Tubular Labs 7
  • 8. • How large is your dataset? • How fast will your dataset grow? • What kinds of queries are you running? • How fast will usage grow? • When do you want to reindex next? • I’m sure there are more... It Depends... © 2016 Tubular Labs 8
  • 9. How do we get answers? © 2016 Tubular Labs 9
  • 11. What We Want • Repeatable • Others can easily run the same tests and should get about the same results • Easily modified • Easy to define and understand • Easy to run • understandable results Repeatable Elasticsearch Experiments: © 2016 Tubular Labs 11
  • 12. • Benchmarking framework for Elasticsearch • Easily define a set of repeatable tests • Tests are defined in JSON • Compare different configurations • Sets up a single node cluster for tests or target existing (external) clusters • Targeting external clusters is not fully supported and you’ll get warnings telling you as much What is Rally? © 2016 Tubular Labs 12
  • 13. Terms •Track - a benchmarking scenario •Car - system (Elasticsearch) configuration for a benchmark •Challenge - what benchmarks are run and its configuration •Race - an actual run of the benchmark •Tournaments - A way to analyze the impact of changes What is Rally? © 2016 Tubular Labs 13
  • 16. NOTE: The following experiments are written as we would do them next time. Due to time constraints we had to do some of this in parallel. I’ll also mention where we deviated from what is in the next few slides. • We’re still pretty new at running benchmarks with Elasticsearch so we’re still learning the best way to do this. • Running these tests answered a lot of questions (and raised brand new ones) How we used this at Tubular Labs © 2016 Tubular Labs 16
  • 17. How big should my shards be? Determining a good shard size © 2016 Tubular Labs 17
  • 18. The experiment 1. Obtain a realistic data set 2. Write the Rally config to: • Index your data (single shard) • Run a set of common queries 3. Run benchmark with different document counts 4. Graph the results Determining a good shard size © 2016 Tubular Labs 18
  • 19. The queries we used • Query A and B: • Very similar but aggregate on a slightly different set of terms • Hits about 10% of our dataset • Query C and D: • Same aggregations as queries A and B • Full dataset Determining a good shard size © 2016 Tubular Labs 19
  • 20. Our results Determining a good shard size © 2016 Tubular Labs 20
  • 21. We need to consider • How fast do you need each query to be? • How much do you expect your data set to grow before you want to look at reindexing again? • Your use case likely will have other concerns as well Determining a good shard size © 2016 Tubular Labs 21
  • 22. How many shards per node? Determining how many shards per node © 2016 Tubular Labs 22
  • 23. The experiment (almost the same as before) 1. Obtain a dataset of realistic data 2. Write the Rally config to: • Index your data • Run a set of common queries 3. Run benchmark with different shard counts 4. Graph the results Determining how many shards per node © 2016 Tubular Labs 23
  • 24. What we did differently this time (time constraints) • Used the Apache HTTP Benchmark Tool with a script to run the queries. • Our production cluster had 26 data nodes with about 200 million documents each • Wanted to avoid expanding the cluster further if at all possible (c3.8xlarge is pricey!) • 10 total shards per node (about 20 million docs/shard) • 16 total shards per node (about 12.5 million docs/shard) • 32 total shards per node (about 6.25 million docs/shard) • Tested on 3 node clusters (2 data nodes, 1 client/master) Determining how many shards per node © 2016 Tubular Labs 24
  • 25. Our Results - Testing Number of Shards per node Query response by shard count (C 1) Query response by shard count (C 3) © 2016 Tubular Labs 25
  • 26. Our Results - Testing Number of Shards per node Query response production vs test (C 1) Query response production vs test (C 3) © 2016 Tubular Labs 26 Production - 26 data nodes Test Cluster - 2 data nodes
  • 27. • Significant performance drop in each level of testing, why? • A single shard on a single node performed much better than our multiple shards per node tests • The fully loaded 3 node cluster performed much better than our full cluster in production • Impact of moving to a machine with more memory • Will the extra file system cache make a large difference? New Questions Raised © 2016 Tubular Labs 27
  • 28. Query load isn’t evenly distributed Current path of performance investigation © 2016 Tubular Labs 28 1 4 3* 2* 5* 8* 10 13* 11 6* 2 5 7* 4* 10* 9* 11* 12* 14 15 3 6 1* 9 13 8 12 7 15* 14*
  • 30. Rally related • Document count in track.json != the document count Rally checks at the end of indexing with nested documents. • Multi node support not yet available Problems We Encountered? © 2016 Tubular Labs 30
  • 31. Non Rally related •Performance in reality wasn’t as good as our testing suggested it should be • We haven’t found the reason for this yet • We’ve noticed a correlation between the number of shards a query hits per node and the time taken to run the query on the shard but have not yet identified the bottleneck. • We were able to mitigate this by adding additional data nodes Problems We Encountered? © 2016 Tubular Labs 31