SlideShare a Scribd company logo
1 of 87
Scaling to 30,000 Requests Per Second
and Beyond
with MongoDB
Mike Chesnut
Director of Operations Engineering
Crittercism
Scaling to 30,000 Requests Per Second
and Beyond
with MongoDB
Mike Chesnut
Director of Operations Engineering
Crittercism
MongoDB World
June 23-25
world.mongodb.com
Code: 25GN for 25% off
MongoDB World
June 23-25
world.mongodb.com
Code: 25GN for 25% off
What I’ll Talk About
What I’ll Talk About
● Crittercism - Overview
● Router (mongos) Architecture
● Sharding Considerations
● The Balancer and Me
● Q&A
How a Startup Gets Started
● Pick something and go with it
● Make mistakes along the way
● Correct the mistakes you can
● Work around the ones you can’t
How a Startup Gets Started
Critter-What?
A Brief History...
Critter-What?
Architecture
APIFeedback
Architecture
APIFeedback
App Loads
Crashes
Handled
Exceptions
Architecture
APIFeedback
App Loads
Crashes
Handled
Exceptions
Architecture
DynamoDB
APIFeedback
App Loads
Crashes
Handled
Exceptions
Metadata
Architecture
DynamoDB
APIFeedback
App Loads
Crashes
Handled
Exceptions
Metadata
Architecture
DynamoDB
API
API
Feedback
App Loads
Crashes
Handled
Exceptions
Metadata
Performance
Data
Geo Data
Critter-What?
… Which brings us to today.
Critter-What?
Critter-What?
● feedback widget
● crash reporting
● live stats
● crash grouping
● app performance
management
● geo data
● user analytics
● executive
dashboard
Architecture
DynamoDB
API
API
Feedback
App Loads
Crashes
Handled
Exceptions
Metadata
Performance
Data
Geo Data
Architecture
DynamoDB
API
API
Feedback
App Loads
Crashes
Handled
Exceptions
Metadata
Performance
Data
Geo Data
40,000+ req/s
Growth
Router Architecture
Router Architecture
mongod
server
mongod
server
mongod
server
replica set
mongod
server
mongod
server
mongod
server
replica set
mongod
server
mongod
server
mongod
server
replica set
mongos
client
process
application server
mongos
client
process
application server
Client Application(s) MongoDB Cluster
Single mongos per client problems we encountered:
Router Architecture
Router Architecture
Single mongos per client problems we encountered:
● thousands of connections to config servers
● config server CPU load
● configdb propagation delays
Router Architecture
mongod
server
mongod
server
mongod
server
replica set
mongod
server
mongod
server
mongod
server
replica set
mongod
server
mongod
server
mongod
server
replica set
mongos
client
process
application server
mongos
client
process
application server
Client Application(s) MongoDB ClusterRouter Tier
Router Architecture
Separate mongos tier advantages:
Router Architecture
Separate mongos tier advantages:
● greatly reduced number of connections to each mongod
● far fewer hosts talking to the config servers
● much faster configdb propagation
Router Architecture
Separate mongos tier advantages:
● greatly reduced number of connections to each mongod
● far fewer hosts talking to the config servers
● much faster configdb propagation
Disadvantages:
Router Architecture
Separate mongos tier advantages:
● greatly reduced number of connections to each mongod
● far fewer hosts talking to the config servers
● much faster configdb propagation
Disadvantages:
● additional network hop
● fewer points of failure
Sharding Considerations
Pick something you want to live with.
Sharding Considerations
Sharding Considerations
Sharding Considerations
Sharding Considerations
Sharding Considerations
Sharding Considerations
Sharding Considerations
Sharding Considerations
Sharding Considerations
Sharding Considerations
Sharding Considerations
Sharding Considerations
Sharding Considerations
Sharding Considerations
Sharding Considerations
Sharding Considerations
The Balancer and Me
The Balancer and Me
Why wouldn’t you run the balancer in the first place?
● great question
● for us, it’s because we deleted a ton of data at one point, and left a
bunch of holes
○ we turned it off while deleting this data
○ and then were unable to turn it back on
● but maybe you start without it
● or maybe you need to turn it off for maintenance and forget to turn
it back on
Obviously, don’t do this. But if you do, here’s what happens...
The Balancer and Me
Fresh, new, empty cluster… But no balancer running.
The Balancer and Me
The Balancer and Me
The Balancer and Me
The Balancer and Me
The Balancer and Me
The Balancer and Me
Now we’re pretty full, so let’s add another shard...
The Balancer and Me
The Balancer and Me
And keep inserting...
The Balancer and Me
Suddenly we find ourselves with a very unbalanced cluster.
The Balancer and Me
But if we enable the balancer, it will DoS the 5th shard!
The Balancer and Me
The approximate effect looks something like this:
The Balancer and Me
The approximate effect looks something like this:
The Balancer and Me
The approximate effect looks something like this:
The Balancer and Me
The approximate effect looks something like this:
The Balancer and Me
The approximate effect looks something like this:
The Balancer and Me
The approximate effect looks something like this:
So what can we do?
The Balancer and Me
So what can we do?
1. add IOPS
The Balancer and Me
So what can we do?
1. add IOPS
2. make sure your config servers have plenty of CPU (and IOPS)
The Balancer and Me
So what can we do?
1. add IOPS
2. make sure your config servers have plenty of CPU (and IOPS)
3. slowly move chunks manually
The Balancer and Me
So what can we do?
1. add IOPS
2. make sure your config servers have plenty of CPU (and IOPS)
3. slowly move chunks manually
4. approach a balanced state
The Balancer and Me
So what can we do?
1. add IOPS
2. make sure your config servers have plenty of CPU (and IOPS)
3. slowly move chunks manually
4. approach a balanced state
5. hold your breath
The Balancer and Me
So what can we do?
1. add IOPS
2. make sure your config servers have plenty of CPU (and IOPS)
3. slowly move chunks manually
4. approach a balanced state
5. hold your breath
6. try re-enabling the balancer
The Balancer and Me
How to manually balance:
The Balancer and Me
How to manually balance:
1. determine a chunk on a hot shard
2. monitor effects on both the source and target shards
3. move the chunk
4. allow the system to settle
5. repeat
The Balancer and Me
How to manually balance:
1. determine a chunk on a hot shard
mongos> db.chunks.find({"shard":"<shard_name>",
"ns":"<db_name>.<collection>"}).limit(1).pretty()
You’ll get a single chunk (as both min and max); note its shard key and
ObjectId.
The Balancer and Me
How to manually balance:
1. determine a chunk on a hot shard
"min" : {
"unsymbolized_hash" :
"1572663b72e87[...]",
"_id" : ObjectId("50b97db98238[...]")
},
The Balancer and Me
How to manually balance:
1. determine a chunk on a hot shard
2. monitor effects on both the source and target shards
iostat -xhm 1
mongostat
The Balancer and Me
How to manually balance:
1. determine a chunk on a hot shard
2. monitor effects on both the source and target shards
3. move the chunk
mongos> sh.moveChunk("<db_name>.<collection>", {
"unsymbolized_hash" : "1572663b72e87[...]",
"_id" : ObjectId("50b97db98238[...]") },
"<target_shard>")
The Balancer and Me
How to manually balance:
1. determine a chunk on a hot shard
2. monitor effects on both the source and target shards
3. move the chunk
4. allow the system to settle
5. repeat
The Balancer and Me
Conclusion here:
Run the balancer.
The Balancer and Me
● Design ahead of time
o “NoSQL” lets you play it by ear
o but some of these decisions will bite you later
● Be willing to correct past mistakes
o dedicate time and resources to adapting
o learn how to live with the mistakes you can’t correct
Summary
References
● MongoDB Blog post:http://blog.mongodb.org/post/77278906988/crittercism-scaling-to-
billions-of-requests-per-day-on
● MongoDB Documentation on mongos
routers:http://docs.mongodb.org/master/core/sharded-cluster-query-routing/
● MongoDB Documentation on the
balancer:http://docs.mongodb.org/manual/tutorial/manage-sharded-cluster-balancer/
● MongoDB Documentation on shard
keys:http://docs.mongodb.org/manual/core/sharding-shard-key/
Crittercism: http://www.crittercism.com/
MongoDB World
June 23-25
world.mongodb.com
Code: 25GN for 25% off
Q&A
Thank You!

More Related Content

What's hot

Replacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDBReplacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDBScyllaDB
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and FutureDataWorks Summit
 
10 big data hadoop
10 big data hadoop10 big data hadoop
10 big data hadoopPatrick Bury
 
GCP Data Engineer cheatsheet
GCP Data Engineer cheatsheetGCP Data Engineer cheatsheet
GCP Data Engineer cheatsheetGuang Xu
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowWes McKinney
 
OpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt Ahrens
OpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt AhrensOpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt Ahrens
OpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt AhrensMatthew Ahrens
 
Cassandra
Cassandra Cassandra
Cassandra Pooja GV
 
Maximizing EC2 and Elastic Block Store Disk Performance
Maximizing EC2 and Elastic Block Store Disk PerformanceMaximizing EC2 and Elastic Block Store Disk Performance
Maximizing EC2 and Elastic Block Store Disk PerformanceAmazon Web Services
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentationArvind Kumar
 
Quantitative Performance Evaluation of Cloud-Based MySQL (Relational) Vs. Mon...
Quantitative Performance Evaluation of Cloud-Based MySQL (Relational) Vs. Mon...Quantitative Performance Evaluation of Cloud-Based MySQL (Relational) Vs. Mon...
Quantitative Performance Evaluation of Cloud-Based MySQL (Relational) Vs. Mon...Darshan Gorasiya
 
Imply at Apache Druid Meetup in London 1-15-20
Imply at Apache Druid Meetup in London 1-15-20Imply at Apache Druid Meetup in London 1-15-20
Imply at Apache Druid Meetup in London 1-15-20Jelena Zanko
 
Getting Started with Amazon ElastiCache
Getting Started with Amazon ElastiCacheGetting Started with Amazon ElastiCache
Getting Started with Amazon ElastiCacheAmazon Web Services
 
All Your IOPS Are Belong To Us - A Pinteresting Case Study in MySQL Performan...
All Your IOPS Are Belong To Us - A Pinteresting Case Study in MySQL Performan...All Your IOPS Are Belong To Us - A Pinteresting Case Study in MySQL Performan...
All Your IOPS Are Belong To Us - A Pinteresting Case Study in MySQL Performan...Ernie Souhrada
 
Load Testing - How to Stress Your Odoo with Locust
Load Testing - How to Stress Your Odoo with LocustLoad Testing - How to Stress Your Odoo with Locust
Load Testing - How to Stress Your Odoo with LocustOdoo
 
Hive spark-s3acommitter-hbase-nfs
Hive spark-s3acommitter-hbase-nfsHive spark-s3acommitter-hbase-nfs
Hive spark-s3acommitter-hbase-nfsYifeng Jiang
 
A tour of Amazon Redshift
A tour of Amazon RedshiftA tour of Amazon Redshift
A tour of Amazon RedshiftKel Graham
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
 

What's hot (20)

Replacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDBReplacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDB
 
Log Structured Merge Tree
Log Structured Merge TreeLog Structured Merge Tree
Log Structured Merge Tree
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
10 big data hadoop
10 big data hadoop10 big data hadoop
10 big data hadoop
 
GCP Data Engineer cheatsheet
GCP Data Engineer cheatsheetGCP Data Engineer cheatsheet
GCP Data Engineer cheatsheet
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
 
OpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt Ahrens
OpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt AhrensOpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt Ahrens
OpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt Ahrens
 
Cassandra
Cassandra Cassandra
Cassandra
 
Maximizing EC2 and Elastic Block Store Disk Performance
Maximizing EC2 and Elastic Block Store Disk PerformanceMaximizing EC2 and Elastic Block Store Disk Performance
Maximizing EC2 and Elastic Block Store Disk Performance
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentation
 
Quantitative Performance Evaluation of Cloud-Based MySQL (Relational) Vs. Mon...
Quantitative Performance Evaluation of Cloud-Based MySQL (Relational) Vs. Mon...Quantitative Performance Evaluation of Cloud-Based MySQL (Relational) Vs. Mon...
Quantitative Performance Evaluation of Cloud-Based MySQL (Relational) Vs. Mon...
 
Imply at Apache Druid Meetup in London 1-15-20
Imply at Apache Druid Meetup in London 1-15-20Imply at Apache Druid Meetup in London 1-15-20
Imply at Apache Druid Meetup in London 1-15-20
 
Apache Flume
Apache FlumeApache Flume
Apache Flume
 
Getting Started with Amazon ElastiCache
Getting Started with Amazon ElastiCacheGetting Started with Amazon ElastiCache
Getting Started with Amazon ElastiCache
 
All Your IOPS Are Belong To Us - A Pinteresting Case Study in MySQL Performan...
All Your IOPS Are Belong To Us - A Pinteresting Case Study in MySQL Performan...All Your IOPS Are Belong To Us - A Pinteresting Case Study in MySQL Performan...
All Your IOPS Are Belong To Us - A Pinteresting Case Study in MySQL Performan...
 
Load Testing - How to Stress Your Odoo with Locust
Load Testing - How to Stress Your Odoo with LocustLoad Testing - How to Stress Your Odoo with Locust
Load Testing - How to Stress Your Odoo with Locust
 
Hive spark-s3acommitter-hbase-nfs
Hive spark-s3acommitter-hbase-nfsHive spark-s3acommitter-hbase-nfs
Hive spark-s3acommitter-hbase-nfs
 
A tour of Amazon Redshift
A tour of Amazon RedshiftA tour of Amazon Redshift
A tour of Amazon Redshift
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 

Similar to Scaling to 30,000 Requests Per Second and Beyond with MongoDB

Occupational Health and Safety
Occupational Health and SafetyOccupational Health and Safety
Occupational Health and Safetyaeromarine
 
OHSNETbase Presentation
OHSNETbase PresentationOHSNETbase Presentation
OHSNETbase Presentationaeromarine
 
Expecto Performa! The Magic and Reality of Performance Tuning
Expecto Performa! The Magic and Reality of Performance TuningExpecto Performa! The Magic and Reality of Performance Tuning
Expecto Performa! The Magic and Reality of Performance TuningAtlassian
 
Data oriented design and c++
Data oriented design and c++Data oriented design and c++
Data oriented design and c++Mike Acton
 
Ruby3x3: How are we going to measure 3x
Ruby3x3: How are we going to measure 3xRuby3x3: How are we going to measure 3x
Ruby3x3: How are we going to measure 3xMatthew Gaudet
 
Moved to https://slidr.io/azzazzel/web-application-performance-tuning-beyond-xmx
Moved to https://slidr.io/azzazzel/web-application-performance-tuning-beyond-xmxMoved to https://slidr.io/azzazzel/web-application-performance-tuning-beyond-xmx
Moved to https://slidr.io/azzazzel/web-application-performance-tuning-beyond-xmxMilen Dyankov
 
OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl
OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl
OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl OpenNebula Project
 
Monitoring of OpenNebula installations
Monitoring of OpenNebula installationsMonitoring of OpenNebula installations
Monitoring of OpenNebula installationsNETWAYS
 
Practical Code & Data Design
Practical Code & Data DesignPractical Code & Data Design
Practical Code & Data DesignHenryRose9
 
Monitoring your Python with Prometheus (Python Ireland April 2015)
Monitoring your Python with Prometheus (Python Ireland April 2015)Monitoring your Python with Prometheus (Python Ireland April 2015)
Monitoring your Python with Prometheus (Python Ireland April 2015)Brian Brazil
 
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...Codemotion
 
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...Demi Ben-Ari
 
THE RISE AND FALL OF SERVERLESS COSTS - TAMING THE (SERVERLESS) BEAST
THE RISE AND FALL OF SERVERLESS COSTS - TAMING THE (SERVERLESS) BEASTTHE RISE AND FALL OF SERVERLESS COSTS - TAMING THE (SERVERLESS) BEAST
THE RISE AND FALL OF SERVERLESS COSTS - TAMING THE (SERVERLESS) BEASTOpher Dubrovsky
 
MongoDB World 2018: Tutorial - MongoDB Meets Chaos Monkey
MongoDB World 2018: Tutorial - MongoDB Meets Chaos MonkeyMongoDB World 2018: Tutorial - MongoDB Meets Chaos Monkey
MongoDB World 2018: Tutorial - MongoDB Meets Chaos MonkeyMongoDB
 
Delta: Building Merge on Read
Delta: Building Merge on ReadDelta: Building Merge on Read
Delta: Building Merge on ReadDatabricks
 
Ekon24 from Delphi to AVX2
Ekon24 from Delphi to AVX2Ekon24 from Delphi to AVX2
Ekon24 from Delphi to AVX2Arnaud Bouchez
 
Scaling Crittercism to 30,000 Requests Per Second and Beyond with MongoDB
Scaling Crittercism to 30,000 Requests Per Second and Beyond with MongoDBScaling Crittercism to 30,000 Requests Per Second and Beyond with MongoDB
Scaling Crittercism to 30,000 Requests Per Second and Beyond with MongoDBMongoDB
 
Systems Monitoring with Prometheus (Devops Ireland April 2015)
Systems Monitoring with Prometheus (Devops Ireland April 2015)Systems Monitoring with Prometheus (Devops Ireland April 2015)
Systems Monitoring with Prometheus (Devops Ireland April 2015)Brian Brazil
 
MongoDB at Baidu
MongoDB at BaiduMongoDB at Baidu
MongoDB at BaiduMat Keep
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixC4Media
 

Similar to Scaling to 30,000 Requests Per Second and Beyond with MongoDB (20)

Occupational Health and Safety
Occupational Health and SafetyOccupational Health and Safety
Occupational Health and Safety
 
OHSNETbase Presentation
OHSNETbase PresentationOHSNETbase Presentation
OHSNETbase Presentation
 
Expecto Performa! The Magic and Reality of Performance Tuning
Expecto Performa! The Magic and Reality of Performance TuningExpecto Performa! The Magic and Reality of Performance Tuning
Expecto Performa! The Magic and Reality of Performance Tuning
 
Data oriented design and c++
Data oriented design and c++Data oriented design and c++
Data oriented design and c++
 
Ruby3x3: How are we going to measure 3x
Ruby3x3: How are we going to measure 3xRuby3x3: How are we going to measure 3x
Ruby3x3: How are we going to measure 3x
 
Moved to https://slidr.io/azzazzel/web-application-performance-tuning-beyond-xmx
Moved to https://slidr.io/azzazzel/web-application-performance-tuning-beyond-xmxMoved to https://slidr.io/azzazzel/web-application-performance-tuning-beyond-xmx
Moved to https://slidr.io/azzazzel/web-application-performance-tuning-beyond-xmx
 
OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl
OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl
OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl
 
Monitoring of OpenNebula installations
Monitoring of OpenNebula installationsMonitoring of OpenNebula installations
Monitoring of OpenNebula installations
 
Practical Code & Data Design
Practical Code & Data DesignPractical Code & Data Design
Practical Code & Data Design
 
Monitoring your Python with Prometheus (Python Ireland April 2015)
Monitoring your Python with Prometheus (Python Ireland April 2015)Monitoring your Python with Prometheus (Python Ireland April 2015)
Monitoring your Python with Prometheus (Python Ireland April 2015)
 
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
 
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
 
THE RISE AND FALL OF SERVERLESS COSTS - TAMING THE (SERVERLESS) BEAST
THE RISE AND FALL OF SERVERLESS COSTS - TAMING THE (SERVERLESS) BEASTTHE RISE AND FALL OF SERVERLESS COSTS - TAMING THE (SERVERLESS) BEAST
THE RISE AND FALL OF SERVERLESS COSTS - TAMING THE (SERVERLESS) BEAST
 
MongoDB World 2018: Tutorial - MongoDB Meets Chaos Monkey
MongoDB World 2018: Tutorial - MongoDB Meets Chaos MonkeyMongoDB World 2018: Tutorial - MongoDB Meets Chaos Monkey
MongoDB World 2018: Tutorial - MongoDB Meets Chaos Monkey
 
Delta: Building Merge on Read
Delta: Building Merge on ReadDelta: Building Merge on Read
Delta: Building Merge on Read
 
Ekon24 from Delphi to AVX2
Ekon24 from Delphi to AVX2Ekon24 from Delphi to AVX2
Ekon24 from Delphi to AVX2
 
Scaling Crittercism to 30,000 Requests Per Second and Beyond with MongoDB
Scaling Crittercism to 30,000 Requests Per Second and Beyond with MongoDBScaling Crittercism to 30,000 Requests Per Second and Beyond with MongoDB
Scaling Crittercism to 30,000 Requests Per Second and Beyond with MongoDB
 
Systems Monitoring with Prometheus (Devops Ireland April 2015)
Systems Monitoring with Prometheus (Devops Ireland April 2015)Systems Monitoring with Prometheus (Devops Ireland April 2015)
Systems Monitoring with Prometheus (Devops Ireland April 2015)
 
MongoDB at Baidu
MongoDB at BaiduMongoDB at Baidu
MongoDB at Baidu
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 

Recently uploaded

Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 

Recently uploaded (20)

Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 

Scaling to 30,000 Requests Per Second and Beyond with MongoDB

Editor's Notes

  1. I’m going to tell you the story of how we’ve scaled to handle over 30k req/s using a storage strategy based on MongoDB
  2. Between proposing this talk and now, we’ve actually grown some more, and now top 40-45k r/s on a daily basis This is about 3.5B requests per day
  3. this is a preview of a talk I’ll be giving at MongoDB World, June 23-25 in NYC you can still register
  4. and of course Crittercism will be there
  5. some advice from our experience about things to do and things not to do
  6. I’ll be sure to leave time for Q&A
  7. I’ll tell you how Crittercism got started, some of the lessons we’ve learned along the way, and some advice we can share based on those experiences
  8. September 2010 (from Wayback Machine) Started as a “feedback widget” Enable mobile app developers to allow their users to provide “criticism” of their apps (outside of the app store) Not just a star rating
  9. this is pretty easy - set up a (mongo) db, put an api in front of it, collect user feedback from our SDK
  10. added more types of data we collect
  11. volume starts getting large, so let’s count app loads in a memory-based data store (redis), and persist it to mongo
  12. then we added user metadata as well, but that’s a different kind of data and a different volume and access pattern, so let’s add dynamodb into the mix
  13. our volume keeps going up, so let’s cache this app data to make our responses faster
  14. then we added APM, which introduced a lot of different data types and structures so we added another ingest API and postgres into the mix (but obviously we’re not going to talk about that part here…)
  15. today (2014) - what it’s evolved into collecting tons of detailed analytics data - crash reports, groupings Geo data launched in 2013 (just kidding, this is stored in postgres) iPad app launched in 2014 - more aggregations of performance data (more ways to view it)
  16. lots to deal with... so we started as a way for people to “criticize” your apps then we helped you catch bugs, so we’re the ones doing the “criticism”
  17. so how do we handle 40k/s on mongodb?
  18. we don’t, but that’s our ingest rate, and most of it ends up in mongodb the takeaway here is to be willing to use whatever works
  19. 2-year period went from 700/s (60M/day) to 40-45k/s (3.8B/day)
  20. one of the biggest things we did to help ourselves scale was to consolidate the mongos routers
  21. default, first-pass architecture (for a sharded cluster): one mongos per client machine each client process connects to a local mongos router each mongos routes queries and returns results
  22. could mean your application is reading stale data, or can’t find the data it needs when it needs it (and maybe it has to retry, which means it’s now slower)
  23. move the mongos routers to their own tier be smart about how you route to them (we use chef to keep it within the same AZ)
  24. be aware that this does introduce some disadvantages, too
  25. This is a fundamental design decision that will have huge implications for a long time, so think about it carefully.
  26. Hard (impossible) to change after the fact!
  27. Say you have 4 shards. Let’s say each of the NHL teams that made the playoffs this year has an app, and we shard by app_id.
  28. Say you have 4 shards. Let’s say each of the NHL teams that made the playoffs this year has an app, and we shard by app_id. Let’s distribute them evenly, as is likely to be the case (assuming a sufficiently randomly-generated app_id)
  29. this looks nice and even, right?
  30. So now it’s time for the Western Conference Finals, and the Blackhawks are playing the Kings
  31. So those 2 apps are going to get heavy use, but they’re on the same shard, so uh-oh...
  32. Now this shard isn’t happy Higher load, slower response time for queries to this shard (which are your most common queries due to these apps’ popularity)
  33. so let’s add another shard
  34. That might help if we have more teams’ apps to add
  35. Those new apps had somewhere to go, to keep our cluster balanced But this hasn’t helped our uneven access pattern at all
  36. Only option now is to vertically scale the problem shard
  37. and hopefully that cools it off, but now we have an uneven cluster to manage. and what happens next year, when it’s two different teams in the conference finals? maybe we get lucky and they’re on different shards… but even then, maybe the access is uneven enough that those 2 shards still get hot. so maybe you just live with this and have heterogenous shard servers. (this is probably a much lesser evil than trying to re-shard.) lesson: you’re going to have to live with the shard key you choose, so choose wisely! another option might’ve been to spread data for each app_id across all shards--but then your queries will likely be slower (due to having to read from many/all shards). it’s a trade-off.
  38. The balancer is a super-important part of a sharded mongo cluster… You should love it.
  39. Start with an empty cluster, and start filling it with data (we’ll denote “fullness” by going from green to red) This is an example of what can happen when the balancer is not running
  40. Okay, so now we have a very unbalanced cluster. 3 of our replica sets are very full, one is pretty full, and the newest one is hardly in use. (remember that the balancer isn’t running in this scenario)
  41. The balancer will see the full shards and one near-empty one, and will want to move a ton of chunks all at once, causing severe I/O strain on the system. (no way to tell the balancer to chill)
  42. remember that all of these chunk moves are causing updates to your configdb, places load on your config servers, and has to propagate to all mongos routers, too
  43. you’re going to be adding a lot of I/O to the system when you move chunks, and it still has to be able to perform its normal functions, so over-provision we’re in AWS so we just go for PIOPS… but if you’re on physical hardware, consider RAIDing wider, or upgrading your SAN, or...
  44. updating the configdb (when you move chunks) puts load on your config servers, so make sure they’re ready to handle it
  45. this is tedious and will take a LONG time (more detail in a minute)
  46. gradually you’ll get to a happier place
  47. take a deep breath before you...
  48. be ready to turn it off and return to step 3 if needed, then try again
  49. (this was step 3)
  50. here’s an example from our “rawcrashlog” collection (hash and _id truncated)
  51. start both commands running on both the source and target
  52. don’t need to specify source shard, since your shard key (unsymbolized_hash in our case) and _id are sufficient for mongo to know where it’s coming from
  53. watch your monitoring (iostat/mongostat) -- look for spikes in page faults, queued reads/writes, database lock percentages. obviously look at your application monitoring too, to ensure no adverse effects. use MMS as well (e.g., lock %, page faults) if everything looks good, keep going. if not, you need to start over with more IOPS, more config server capacity, etc.
  54. seems obvious, but not always the case. and if you’re not running it, you can embark on this tedious journey to get it running again.
  55. best-case scenario is to make all of the right choices up front… but you’re probably not going to do that. (though hopefully you can learn a bit from our experience and minimize the wrong choices you make). the good news is MongoDB is still working for us, despite the headaches we’ve had to deal with.
  56. reminder that MongoDB World is right around the corner along with all of these great presenters, I’ll be giving a version of this talk there, and would love to meet you