SlideShare a Scribd company logo
1 of 35
Download to read offline
Scaling SolrCloud to a large
number of Collections
Shalin Shekhar Mangar, Lucidworks Inc.
shalin@apache.org
twitter.com/shalinmangar
Apache Solr has a huge install base and tremendous momentum.
SOLRmost widely used search
solution on the planet.
8M+
total downloads
Solr is both established & growing
250,000+
monthly downloads
Solr has tens of thousands
of applications in production.
You use Solr everyday.
Largest community of developers.
2500+open Solr jobs.
Solr scalability is unmatched.
• box.com (Dropbox for business)
• 10TB+ Index Size
• 10 Billion+ Documents
• 100 Million+ Daily Requests
Solr scalability is unmatched.
The traditional search use-case
• One large index distributed across multiple nodes
• A large number of users sharing the data
• Searches across the entire cluster
Example: Product Catalog
Must search across all products
Subset of optional features in Solr to enable
and simplify horizontal scaling a search index
using sharding and replication.
!
Goals
scalability, performance, high-availability,
simplicity, and elasticity
What is SolrCloud?
Terminology
• ZooKeeper: Distributed coordination service that provides centralised configuration,
cluster state management, and leader election
• Node: JVM process bound to a specific port on a machine
• Collection: Search index distributed across multiple nodes with same configuration
• Shard: Logical slice of a collection; each shard has a name, hash range, leader and
replication factor. Documents are assigned to one and only one shard per collection
using a hash-based document routing strategy
• Replica: A copy of a shard in a collection
• Overseer: A special node that executes cluster administration commands and writes
updated state to ZooKeeper. Automatic failover and leader election.
Collection with 2 shards across 4 nodes with replication factor 2
iv
Jetty (node 2, port 8984)
Solr webapp
logstash4solr
shard1 Replica
Java VM
iv
Jetty (node 1, port 8983)
Solr webapp
logstash4solr
shard1 Leader
Java VM
iv
Jetty (node 4, port 8986)
Solr webapp
logstash4solr
shard2 Replica
Java VM
iv
Jetty (node 3, port 8985)
Solr webapp
logstash4solr
shard2 Leader
Java VM
Sharding
Replication
Replication
Zookeeper 1
Zookeeper 2
Zookeeper 3
Leader
ElectionCentralized
configuration
management
ZooKeeper Ensemble
HTTP APIs
XML/JSON/CSV/PDF
Java/Ruby/Python/PHP
Millions of documents, millions of users
“The limits of the possible can only be
defined by going beyond them into the
impossible” — Arthur C. Clarke
The curious case of multi-tenant platforms
• Multi-tenant platform for storage and search
• Thousands of tenant applications
• Each tenant application has millions of users
One SolrCloud collection per tenant
• Searches are specialised to a user’s data or the
tenant application’s dataset
• Some tenants create a lot of data, others very little
• Some use CPU intensive geo-spatial queries, some
just perform simple full text searches and sorting
• Some are write-heavy, others read-heavy
• Some have text in a different natural language
Measure and optimise
• Analyze and find missing features
• Setup a performance testing environment on AWS
• Devise tests for stability and performance
• Find bugs and bottlenecks and fix ’em
Problem #1: Cluster state and updates
• The SolrCloud cluster state has information about the
collections, their shards and replicas
• All nodes and (Java) clients watch the cluster state
• Every state change is notified to all nodes
• Limited to (slightly less than) 1MB by default
• 1 node bounce triggers a few 100 watcher fires and
pulls from ZK for a 100 node cluster (three states:
down, recovering, active)
Solution - Split cluster state and scale
• Each collection gets it’s own state node in ZK
• Nodes selectively watch only those states which
they are a member of
• Clients cache state and use smart cache updates
instead of watching nodes
• http://issues.apache.org/jira/browse/SOLR-5473
Problem #2: Overseer performance
• Thousands of collections create a lot of state
updates
• Overseer falls behind and replicas can’t recover or
can’t elect a leader
• Under high indexing/search load, GC pauses can
cause overseer queue to back up
Solution - Improve the overseer
• Harden the overseer code against ZooKeeper
connection loss (SOLR-5325)
• Optimise polling for new items in overseer queue
(SOLR-5436)
• Dedicated overseers nodes (SOLR-5476)
• New Overseer Status API (SOLR-5749)
• Asynchronous execution of collection commands
(SOLR-5477, SOLR-5681)
Problem #3: Moving data around
• Not all users are born equal - A tenant may have a
few very large users
• We wanted to be able to scale an individual user’s
data — maybe even as it’s own collection
• SolrCloud can split shards with no downtime but it
only splits in half
• No way to ‘extract’ user’s data to another collection
or shard
Solution: Improved data management
• Shard can be split on arbitrary hash ranges
(SOLR-5300)
• Shard can be split by a given key (SOLR-5338,
SOLR-5353)
• A new ‘migrate’ API to move a user’s data to
another (new) collection without downtime
(SOLR-5308)
Problem #4: Exporting data
• Lucene/Solr are designed for finding top-N search
results
• Trying to export full result set brings down the
system due to high memory requirements as you
go deeper
Solution - Distributed deep paging
New ‘cursorMark’ feature for deep paging (SOLR-5463)
–twitter.com/UweSays
“The JVM is completely irresponsible and can
only be killed with ‘kill -9’”
JVM Bugs!
“Testing scale” at scale
• Performance goals: 6 billion documents, 4000 queries/
sec, 400 updates/sec, 2 seconds NRT sustained
performance
• 5% large collections (50 shards), 15% medium (10
shards), 85% small (1 shard) with replication factor of 3
• Target hardware: 24 CPUs, 126G RAM, 7 SSDs (460G)
+ 1 HDD (200G)
• 80% traffic served by 20% of the tenants
How to manage large SolrCloud clusters
• Developed Solr Scale Toolkit
• Fabric based tool to setup and manage SolrCloud
clusters in AWS complete with collectd and SiLK
• Backup/Restore from S3. Parallel clone commands.
• Open source!
• https://github.com/LucidWorks/solr-scale-tk
Gathering metrics and analysing logs
• LucidWorks SiLK (Solr + Logstash + Kibana)
• collectd daemons on each host
• rabbitmq to queue messages before delivering to log stash
• Initially started with Kafka but discarded thinking it is
overkill
• Not happy with rabbitmq — crashes/unstable
• Might try Kafka again soon
• http://www.lucidworks.com/lucidworks-silk
Generating data and load
• Custom randomized data generator (re-producible
using a seed)
• JMeter for generating load
• Embedded CloudSolrServer (Solr Java client)
using JMeter Java Action Sampler
• JMeter distributed mode was itself a bottleneck!
• Not open source (yet) but we’re working on it!
Numb3rs
• 30 hosts, 120 nodes, 1000 collections, 8B+ docs,
15000 queries/second, 2000 writes/second, 2 second
NRT sustained over 24-hours
• More than 3x the numbers our client needed
• Unfortunately, we had to stop testing at that point :(
• Turned out they had a 95-5 traffic ratio than a 80-20
ratio so actual performance is even better :)
• Our biggest cluster cost us just $120/hour :)
Not over yet
• We continue to test performance at scale
• Published indexing performance benchmark,
working on others
• 15 nodes, 30 shards, 1 replica, 157195 docs/sec
• 15 nodes, 30 shards, 2 replicas, 61062 docs/sec
• http://searchhub.org/introducing-the-solr-scale-
toolkit/
Our users are also pushing the limits
https://twitter.com/bretthoerner/status/476830302430437376
Up, up and away!
https://twitter.com/bretthoerner/status/476838275106091008
Not over yet
• SolrCloud continues to be improved
• SOLR-6220 - Replica placement strategy
• SOLR-6273 - Cross data center replication
• SOLR-5656 - Auto-add replicas
• SOLR-5986 - Don’t allow runaway queries to harm
the cluster
• Many, many more
Questions?
• Shalin Shekhar Mangar
• shalin@apache.org
• twitter.com/shalinmangar
• meetup.com/Bangalore-Apache-Solr-Lucene-
Group/
• www.meetup.com/Bangalore-Baby-Apache-Solr-
Group/

More Related Content

What's hot

PostgreSQL Tutorial For Beginners | Edureka
PostgreSQL Tutorial For Beginners | EdurekaPostgreSQL Tutorial For Beginners | Edureka
PostgreSQL Tutorial For Beginners | EdurekaEdureka!
 
Apache kafka performance(latency)_benchmark_v0.3
Apache kafka performance(latency)_benchmark_v0.3Apache kafka performance(latency)_benchmark_v0.3
Apache kafka performance(latency)_benchmark_v0.3SANG WON PARK
 
MySQL Sharding: Tools and Best Practices for Horizontal Scaling
MySQL Sharding: Tools and Best Practices for Horizontal ScalingMySQL Sharding: Tools and Best Practices for Horizontal Scaling
MySQL Sharding: Tools and Best Practices for Horizontal ScalingMats Kindahl
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for ExperimentationGleb Kanterov
 
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...Altinity Ltd
 
Hive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmarkHive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmarkDongwon Kim
 
How to Achieve Scale with MongoDB
How to Achieve Scale with MongoDBHow to Achieve Scale with MongoDB
How to Achieve Scale with MongoDBMongoDB
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Solr on Docker - the Good, the Bad and the Ugly
Solr on Docker - the Good, the Bad and the UglySolr on Docker - the Good, the Bad and the Ugly
Solr on Docker - the Good, the Bad and the UglySematext Group, Inc.
 
Workshop: Learning Elasticsearch
Workshop: Learning ElasticsearchWorkshop: Learning Elasticsearch
Workshop: Learning ElasticsearchAnurag Patel
 
Bridge to Cloud: Using Apache Kafka to Migrate to GCP
Bridge to Cloud: Using Apache Kafka to Migrate to GCPBridge to Cloud: Using Apache Kafka to Migrate to GCP
Bridge to Cloud: Using Apache Kafka to Migrate to GCPconfluent
 
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar ZecevicDataScienceConferenc1
 
Facebook Presto presentation
Facebook Presto presentationFacebook Presto presentation
Facebook Presto presentationCyanny LIANG
 
Spark streaming
Spark streamingSpark streaming
Spark streamingWhiteklay
 
Data Streaming Ecosystem Management at Booking.com
Data Streaming Ecosystem Management at Booking.com Data Streaming Ecosystem Management at Booking.com
Data Streaming Ecosystem Management at Booking.com confluent
 
Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)Julian Hyde
 
How Dashtable Helps Dragonfly Maintain Low Latency
How Dashtable Helps Dragonfly Maintain Low LatencyHow Dashtable Helps Dragonfly Maintain Low Latency
How Dashtable Helps Dragonfly Maintain Low LatencyScyllaDB
 
Unified Log Processing Architecture
Unified Log Processing ArchitectureUnified Log Processing Architecture
Unified Log Processing ArchitectureGuido Schmutz
 

What's hot (20)

PostgreSQL Tutorial For Beginners | Edureka
PostgreSQL Tutorial For Beginners | EdurekaPostgreSQL Tutorial For Beginners | Edureka
PostgreSQL Tutorial For Beginners | Edureka
 
Apache kafka performance(latency)_benchmark_v0.3
Apache kafka performance(latency)_benchmark_v0.3Apache kafka performance(latency)_benchmark_v0.3
Apache kafka performance(latency)_benchmark_v0.3
 
MySQL Sharding: Tools and Best Practices for Horizontal Scaling
MySQL Sharding: Tools and Best Practices for Horizontal ScalingMySQL Sharding: Tools and Best Practices for Horizontal Scaling
MySQL Sharding: Tools and Best Practices for Horizontal Scaling
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for Experimentation
 
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
 
Hive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmarkHive, Presto, and Spark on TPC-DS benchmark
Hive, Presto, and Spark on TPC-DS benchmark
 
How to Achieve Scale with MongoDB
How to Achieve Scale with MongoDBHow to Achieve Scale with MongoDB
How to Achieve Scale with MongoDB
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Solr on Docker - the Good, the Bad and the Ugly
Solr on Docker - the Good, the Bad and the UglySolr on Docker - the Good, the Bad and the Ugly
Solr on Docker - the Good, the Bad and the Ugly
 
Workshop: Learning Elasticsearch
Workshop: Learning ElasticsearchWorkshop: Learning Elasticsearch
Workshop: Learning Elasticsearch
 
Bridge to Cloud: Using Apache Kafka to Migrate to GCP
Bridge to Cloud: Using Apache Kafka to Migrate to GCPBridge to Cloud: Using Apache Kafka to Migrate to GCP
Bridge to Cloud: Using Apache Kafka to Migrate to GCP
 
Database storage engines
Database storage enginesDatabase storage engines
Database storage engines
 
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
 
Facebook Presto presentation
Facebook Presto presentationFacebook Presto presentation
Facebook Presto presentation
 
Spark streaming
Spark streamingSpark streaming
Spark streaming
 
Data Streaming Ecosystem Management at Booking.com
Data Streaming Ecosystem Management at Booking.com Data Streaming Ecosystem Management at Booking.com
Data Streaming Ecosystem Management at Booking.com
 
Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)
 
How Dashtable Helps Dragonfly Maintain Low Latency
How Dashtable Helps Dragonfly Maintain Low LatencyHow Dashtable Helps Dragonfly Maintain Low Latency
How Dashtable Helps Dragonfly Maintain Low Latency
 
Apache Flume
Apache FlumeApache Flume
Apache Flume
 
Unified Log Processing Architecture
Unified Log Processing ArchitectureUnified Log Processing Architecture
Unified Log Processing Architecture
 

Similar to Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014

Scaling SolrCloud to a large number of Collections
Scaling SolrCloud to a large number of CollectionsScaling SolrCloud to a large number of Collections
Scaling SolrCloud to a large number of CollectionsAnshum Gupta
 
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...Lucidworks
 
Introduction to SolrCloud
Introduction to SolrCloudIntroduction to SolrCloud
Introduction to SolrCloudVarun Thacker
 
Deploying and managing Solr at scale
Deploying and managing Solr at scaleDeploying and managing Solr at scale
Deploying and managing Solr at scaleAnshum Gupta
 
Meet Solr For The Tirst Again
Meet Solr For The Tirst AgainMeet Solr For The Tirst Again
Meet Solr For The Tirst AgainVarun Thacker
 
Solr Powered Lucene
Solr Powered LuceneSolr Powered Lucene
Solr Powered LuceneErik Hatcher
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scalethelabdude
 
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCLucidworks (Archived)
 
Solr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloudSolr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloudthelabdude
 
Solr search engine with multiple table relation
Solr search engine with multiple table relationSolr search engine with multiple table relation
Solr search engine with multiple table relationJay Bharat
 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platformTommaso Teofili
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to SolrErik Hatcher
 
Benchmarking Solr Performance
Benchmarking Solr PerformanceBenchmarking Solr Performance
Benchmarking Solr PerformanceLucidworks
 
What's new in Solr 5.0
What's new in Solr 5.0What's new in Solr 5.0
What's new in Solr 5.0Anshum Gupta
 
Automated Cluster Management and Recovery for Large Scale Multi-Tenant Sea...
  Automated Cluster Management and Recovery  for Large Scale Multi-Tenant Sea...  Automated Cluster Management and Recovery  for Large Scale Multi-Tenant Sea...
Automated Cluster Management and Recovery for Large Scale Multi-Tenant Sea...Lucidworks
 
Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
Your Big Data Stack is Too Big!: Presented by Timothy Potter, LucidworksYour Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
Your Big Data Stack is Too Big!: Presented by Timothy Potter, LucidworksLucidworks
 

Similar to Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014 (20)

Scaling SolrCloud to a large number of Collections
Scaling SolrCloud to a large number of CollectionsScaling SolrCloud to a large number of Collections
Scaling SolrCloud to a large number of Collections
 
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
Scaling SolrCloud to a Large Number of Collections: Presented by Shalin Shekh...
 
Introduction to SolrCloud
Introduction to SolrCloudIntroduction to SolrCloud
Introduction to SolrCloud
 
Deploying and managing Solr at scale
Deploying and managing Solr at scaleDeploying and managing Solr at scale
Deploying and managing Solr at scale
 
Meet Solr For The Tirst Again
Meet Solr For The Tirst AgainMeet Solr For The Tirst Again
Meet Solr For The Tirst Again
 
Solr Powered Lucene
Solr Powered LuceneSolr Powered Lucene
Solr Powered Lucene
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scale
 
Solr 4
Solr 4Solr 4
Solr 4
 
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
 
Solr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloudSolr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloud
 
Solr search engine with multiple table relation
Solr search engine with multiple table relationSolr search engine with multiple table relation
Solr search engine with multiple table relation
 
Solr
SolrSolr
Solr
 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platform
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Benchmarking Solr Performance
Benchmarking Solr PerformanceBenchmarking Solr Performance
Benchmarking Solr Performance
 
Solr Recipes
Solr RecipesSolr Recipes
Solr Recipes
 
What's new in Solr 5.0
What's new in Solr 5.0What's new in Solr 5.0
What's new in Solr 5.0
 
Automated Cluster Management and Recovery for Large Scale Multi-Tenant Sea...
  Automated Cluster Management and Recovery  for Large Scale Multi-Tenant Sea...  Automated Cluster Management and Recovery  for Large Scale Multi-Tenant Sea...
Automated Cluster Management and Recovery for Large Scale Multi-Tenant Sea...
 
Big Search with Big Data Principles
Big Search with Big Data PrinciplesBig Search with Big Data Principles
Big Search with Big Data Principles
 
Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
Your Big Data Stack is Too Big!: Presented by Timothy Potter, LucidworksYour Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
 

More from Shalin Shekhar Mangar

Solr BoF (Birds of a Feather) session at Fifth Elephant 2018
Solr BoF (Birds of a Feather) session at Fifth Elephant 2018Solr BoF (Birds of a Feather) session at Fifth Elephant 2018
Solr BoF (Birds of a Feather) session at Fifth Elephant 2018Shalin Shekhar Mangar
 
Cross Datacenter Replication in Apache Solr 6
Cross Datacenter Replication in Apache Solr 6Cross Datacenter Replication in Apache Solr 6
Cross Datacenter Replication in Apache Solr 6Shalin Shekhar Mangar
 
Parallel SQL and Streaming Expressions in Apache Solr 6
Parallel SQL and Streaming Expressions in Apache Solr 6Parallel SQL and Streaming Expressions in Apache Solr 6
Parallel SQL and Streaming Expressions in Apache Solr 6Shalin Shekhar Mangar
 
Call me maybe: Jepsen and flaky networks
Call me maybe: Jepsen and flaky networksCall me maybe: Jepsen and flaky networks
Call me maybe: Jepsen and flaky networksShalin Shekhar Mangar
 
Inside Solr 5 - Bangalore Solr/Lucene Meetup
Inside Solr 5 - Bangalore Solr/Lucene MeetupInside Solr 5 - Bangalore Solr/Lucene Meetup
Inside Solr 5 - Bangalore Solr/Lucene MeetupShalin Shekhar Mangar
 
GIDS2014: SolrCloud: Searching Big Data
GIDS2014: SolrCloud: Searching Big DataGIDS2014: SolrCloud: Searching Big Data
GIDS2014: SolrCloud: Searching Big DataShalin Shekhar Mangar
 
Get involved with the Apache Software Foundation
Get involved with the Apache Software FoundationGet involved with the Apache Software Foundation
Get involved with the Apache Software FoundationShalin Shekhar Mangar
 

More from Shalin Shekhar Mangar (11)

Solr BoF (Birds of a Feather) session at Fifth Elephant 2018
Solr BoF (Birds of a Feather) session at Fifth Elephant 2018Solr BoF (Birds of a Feather) session at Fifth Elephant 2018
Solr BoF (Birds of a Feather) session at Fifth Elephant 2018
 
Cross Datacenter Replication in Apache Solr 6
Cross Datacenter Replication in Apache Solr 6Cross Datacenter Replication in Apache Solr 6
Cross Datacenter Replication in Apache Solr 6
 
Parallel SQL and Streaming Expressions in Apache Solr 6
Parallel SQL and Streaming Expressions in Apache Solr 6Parallel SQL and Streaming Expressions in Apache Solr 6
Parallel SQL and Streaming Expressions in Apache Solr 6
 
Intro to Apache Solr
Intro to Apache SolrIntro to Apache Solr
Intro to Apache Solr
 
Call me maybe: Jepsen and flaky networks
Call me maybe: Jepsen and flaky networksCall me maybe: Jepsen and flaky networks
Call me maybe: Jepsen and flaky networks
 
Inside Solr 5 - Bangalore Solr/Lucene Meetup
Inside Solr 5 - Bangalore Solr/Lucene MeetupInside Solr 5 - Bangalore Solr/Lucene Meetup
Inside Solr 5 - Bangalore Solr/Lucene Meetup
 
High Performance Solr
High Performance SolrHigh Performance Solr
High Performance Solr
 
GIDS2014: SolrCloud: Searching Big Data
GIDS2014: SolrCloud: Searching Big DataGIDS2014: SolrCloud: Searching Big Data
GIDS2014: SolrCloud: Searching Big Data
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
SolrCloud and Shard Splitting
SolrCloud and Shard SplittingSolrCloud and Shard Splitting
SolrCloud and Shard Splitting
 
Get involved with the Apache Software Foundation
Get involved with the Apache Software FoundationGet involved with the Apache Software Foundation
Get involved with the Apache Software Foundation
 

Recently uploaded

CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceanilsa9823
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 

Recently uploaded (20)

CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 

Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014

  • 1. Scaling SolrCloud to a large number of Collections Shalin Shekhar Mangar, Lucidworks Inc. shalin@apache.org twitter.com/shalinmangar
  • 2. Apache Solr has a huge install base and tremendous momentum. SOLRmost widely used search solution on the planet. 8M+ total downloads Solr is both established & growing 250,000+ monthly downloads Solr has tens of thousands of applications in production. You use Solr everyday. Largest community of developers. 2500+open Solr jobs.
  • 3. Solr scalability is unmatched. • box.com (Dropbox for business) • 10TB+ Index Size • 10 Billion+ Documents • 100 Million+ Daily Requests
  • 4. Solr scalability is unmatched.
  • 5. The traditional search use-case • One large index distributed across multiple nodes • A large number of users sharing the data • Searches across the entire cluster
  • 6. Example: Product Catalog Must search across all products
  • 7. Subset of optional features in Solr to enable and simplify horizontal scaling a search index using sharding and replication. ! Goals scalability, performance, high-availability, simplicity, and elasticity What is SolrCloud?
  • 8. Terminology • ZooKeeper: Distributed coordination service that provides centralised configuration, cluster state management, and leader election • Node: JVM process bound to a specific port on a machine • Collection: Search index distributed across multiple nodes with same configuration • Shard: Logical slice of a collection; each shard has a name, hash range, leader and replication factor. Documents are assigned to one and only one shard per collection using a hash-based document routing strategy • Replica: A copy of a shard in a collection • Overseer: A special node that executes cluster administration commands and writes updated state to ZooKeeper. Automatic failover and leader election.
  • 9. Collection with 2 shards across 4 nodes with replication factor 2 iv Jetty (node 2, port 8984) Solr webapp logstash4solr shard1 Replica Java VM iv Jetty (node 1, port 8983) Solr webapp logstash4solr shard1 Leader Java VM iv Jetty (node 4, port 8986) Solr webapp logstash4solr shard2 Replica Java VM iv Jetty (node 3, port 8985) Solr webapp logstash4solr shard2 Leader Java VM Sharding Replication Replication Zookeeper 1 Zookeeper 2 Zookeeper 3 Leader ElectionCentralized configuration management ZooKeeper Ensemble HTTP APIs XML/JSON/CSV/PDF Java/Ruby/Python/PHP Millions of documents, millions of users
  • 10. “The limits of the possible can only be defined by going beyond them into the impossible” — Arthur C. Clarke
  • 11. The curious case of multi-tenant platforms • Multi-tenant platform for storage and search • Thousands of tenant applications • Each tenant application has millions of users
  • 12. One SolrCloud collection per tenant • Searches are specialised to a user’s data or the tenant application’s dataset • Some tenants create a lot of data, others very little • Some use CPU intensive geo-spatial queries, some just perform simple full text searches and sorting • Some are write-heavy, others read-heavy • Some have text in a different natural language
  • 13. Measure and optimise • Analyze and find missing features • Setup a performance testing environment on AWS • Devise tests for stability and performance • Find bugs and bottlenecks and fix ’em
  • 14. Problem #1: Cluster state and updates • The SolrCloud cluster state has information about the collections, their shards and replicas • All nodes and (Java) clients watch the cluster state • Every state change is notified to all nodes • Limited to (slightly less than) 1MB by default • 1 node bounce triggers a few 100 watcher fires and pulls from ZK for a 100 node cluster (three states: down, recovering, active)
  • 15. Solution - Split cluster state and scale • Each collection gets it’s own state node in ZK • Nodes selectively watch only those states which they are a member of • Clients cache state and use smart cache updates instead of watching nodes • http://issues.apache.org/jira/browse/SOLR-5473
  • 16. Problem #2: Overseer performance • Thousands of collections create a lot of state updates • Overseer falls behind and replicas can’t recover or can’t elect a leader • Under high indexing/search load, GC pauses can cause overseer queue to back up
  • 17. Solution - Improve the overseer • Harden the overseer code against ZooKeeper connection loss (SOLR-5325) • Optimise polling for new items in overseer queue (SOLR-5436) • Dedicated overseers nodes (SOLR-5476) • New Overseer Status API (SOLR-5749) • Asynchronous execution of collection commands (SOLR-5477, SOLR-5681)
  • 18. Problem #3: Moving data around • Not all users are born equal - A tenant may have a few very large users • We wanted to be able to scale an individual user’s data — maybe even as it’s own collection • SolrCloud can split shards with no downtime but it only splits in half • No way to ‘extract’ user’s data to another collection or shard
  • 19. Solution: Improved data management • Shard can be split on arbitrary hash ranges (SOLR-5300) • Shard can be split by a given key (SOLR-5338, SOLR-5353) • A new ‘migrate’ API to move a user’s data to another (new) collection without downtime (SOLR-5308)
  • 20. Problem #4: Exporting data • Lucene/Solr are designed for finding top-N search results • Trying to export full result set brings down the system due to high memory requirements as you go deeper
  • 21. Solution - Distributed deep paging New ‘cursorMark’ feature for deep paging (SOLR-5463)
  • 22. –twitter.com/UweSays “The JVM is completely irresponsible and can only be killed with ‘kill -9’” JVM Bugs!
  • 23. “Testing scale” at scale • Performance goals: 6 billion documents, 4000 queries/ sec, 400 updates/sec, 2 seconds NRT sustained performance • 5% large collections (50 shards), 15% medium (10 shards), 85% small (1 shard) with replication factor of 3 • Target hardware: 24 CPUs, 126G RAM, 7 SSDs (460G) + 1 HDD (200G) • 80% traffic served by 20% of the tenants
  • 24.
  • 25.
  • 26.
  • 27. How to manage large SolrCloud clusters • Developed Solr Scale Toolkit • Fabric based tool to setup and manage SolrCloud clusters in AWS complete with collectd and SiLK • Backup/Restore from S3. Parallel clone commands. • Open source! • https://github.com/LucidWorks/solr-scale-tk
  • 28. Gathering metrics and analysing logs • LucidWorks SiLK (Solr + Logstash + Kibana) • collectd daemons on each host • rabbitmq to queue messages before delivering to log stash • Initially started with Kafka but discarded thinking it is overkill • Not happy with rabbitmq — crashes/unstable • Might try Kafka again soon • http://www.lucidworks.com/lucidworks-silk
  • 29. Generating data and load • Custom randomized data generator (re-producible using a seed) • JMeter for generating load • Embedded CloudSolrServer (Solr Java client) using JMeter Java Action Sampler • JMeter distributed mode was itself a bottleneck! • Not open source (yet) but we’re working on it!
  • 30. Numb3rs • 30 hosts, 120 nodes, 1000 collections, 8B+ docs, 15000 queries/second, 2000 writes/second, 2 second NRT sustained over 24-hours • More than 3x the numbers our client needed • Unfortunately, we had to stop testing at that point :( • Turned out they had a 95-5 traffic ratio than a 80-20 ratio so actual performance is even better :) • Our biggest cluster cost us just $120/hour :)
  • 31. Not over yet • We continue to test performance at scale • Published indexing performance benchmark, working on others • 15 nodes, 30 shards, 1 replica, 157195 docs/sec • 15 nodes, 30 shards, 2 replicas, 61062 docs/sec • http://searchhub.org/introducing-the-solr-scale- toolkit/
  • 32. Our users are also pushing the limits https://twitter.com/bretthoerner/status/476830302430437376
  • 33. Up, up and away! https://twitter.com/bretthoerner/status/476838275106091008
  • 34. Not over yet • SolrCloud continues to be improved • SOLR-6220 - Replica placement strategy • SOLR-6273 - Cross data center replication • SOLR-5656 - Auto-add replicas • SOLR-5986 - Don’t allow runaway queries to harm the cluster • Many, many more
  • 35. Questions? • Shalin Shekhar Mangar • shalin@apache.org • twitter.com/shalinmangar • meetup.com/Bangalore-Apache-Solr-Lucene- Group/ • www.meetup.com/Bangalore-Baby-Apache-Solr- Group/