SlideShare a Scribd company logo
1 of 44
Tim Vaillancourt
Sr. Technical Operations Architect
Monitoring MongoDB’s Engines in the Wild
About Me
• Joined Percona in January 2016
• Sr Technical Operations Architect for MongoDB
• Previous:
• EA DICE (MySQL DBA)
• EA SPORTS (Sys/NoSQL DBA Ops)
• Amazon/AbeBooks Inc (Sys/MySQL+NoSQL DBA Ops)
• Main techs: MySQL, MongoDB, Cassandra, Solr, Redis, queues, etc
• 10+ years tuning Linux for database workloads (off and on)
• Monitoring techs
• Nagios
• MRTG
• Munin
• Zabbix
• Cacti
• Graphite
• Prometheus
Storage Engines
• MMAPv1
• Mostly done by Linux kernel
• WiredTiger
• Default as of 3.2
• Percona In-Memory
• Same metrics as WiredTiger
• RocksDB
• PerconaFT / TokuMX
• Deprecated
• Fractal-tree based storage engine
Storage Engines?! The New SE API
• Introduced in MongoDB
3.0
• Abstraction layer for
storage-level interaction
• Allowed integration of
WiredTiger and other
features
Storage Engines: MMAPv1
• Default storage engine < 3.2 (now WiredTiger)
• Collection-level locking (common performance bottleneck)
• Monitored via Lock Ratio/Percent metrics
• In-place datafile updating (when possible)
• OS-level operations
• Uses OS-level mmap() to map BSON files on disk <=> memory
• Uses OS-level filesystem cache as block cache
• Much low(er) monitoring visibility
• Database metrics must be gathered from OS-level
• OS-level metrics are more vague
Storage Engines: MMAPv1
• Document read path
• Try to load from cache
• If not in cache, load from BSON file on
disk
• Document update/write path
• Try to update document in-place
• If too big, “move” document on disk
until a free space is found
• New default engine as of 3.2
• Standalone LSM engine acquired by MongoDB
Inc
• BTree-Based under MongoDB
• Integrated using Storage Engine API
• Document-level locking
• Built-in compression
• Index prefix compression
• MVCC and Concurrency Limits
• High parallelism / CPU utilisation
Storage Engines: WiredTiger
Storage Engines: WiredTiger
• Document Write Path
• Update, delete or write is written to WT log
• Changes to data files are performed by checkpointing later
• Document Read Path
• Looks for data in in-heap cache
• Looks for data in the WT log
• Goes to data files for the data
• Kernel will look in filesystem cache, uncompress result if
exists
• If not in FS cache, read from disk and uncompress result
• Switch compression algorithms if CPU is too high
Storage Engines: RocksDB / MongoRocks
• MongoRocks developed by
• Tiered level compaction strategy
• First layer is called the MemTable
• N number of on-disk levels
• Compaction is triggered when any level is full
• In-heap Block Cache (default 30% RAM)
• Holds uncompressed data
• BlockCache reduces compression CPU hit
• Kernel-level Page Cache for compressed data
• Space amplification of LSM is about +10%
• Optional ‘counters’: storage.rocksdb.counters
Storage Engines: RocksDB / MongoRocks
• Document Write path
• Updates, Deletes and Writes go to Meltable and complete
• Compaction resolves multi-versions of data in the background
• Document Read path
• Looks for data in MemTable
• Level 0 to Level N is asked for the data
• Data is read from filesystem cache, if
present, then uncompressed
• Or, bloom filter is used to find data file,
then data is read and uncompressed
Storage Engines: RocksDB / MongoRocks
• Watch for
• Pending compactions
• Stalls
• Indicates compaction system is overwhelmed, possibly due to I/O
• Level Read Latencies
• If high, disk throughput may be too low
• Rate of compaction in bytes vs any
noticeable slowdown
• Rate of deletes vs read latency
• Deletes add expense to reads and compaction
Metric Sources: operationProfiling
• Writes slow database operations to a new MongoDB collection for analysis
• Capped Collection: “system.profile” in each database, default 100mb
• The collection is capped, ie: profile data doesn’t last forever
• Support for operationProfiling data in Percona Monitoring and Management in current future goals
• Enable operationProfiling in “slowOp” mode
• Start with a very high threshold and decrease it in steps
• Usually 50-100ms is a good threshold
• Enable in mongod.conf
operationProfiling:
slowOpThresholdMs: 100
mode: slowOp
Or the command-line way…
mongod <other-flags> —profile 1 —slowms 100
Metric Sources: operationProfiling
• op/ns/query: type, namespace and query of a profile
• keysExamined: # of index keys examined
• docsExamined: # of docs examined to achieve result
• writeConflicts: # of WCE encountered during update
• numYields: # of times operation yielded for others
• locks: detailed lock statistics
Metric Sources: operationProfiling
• nreturned: # of documents returned by the operation
• nmoved: # of documents moved on disk by the operation
• ndeleted/ninserted/nMatched/nModified: self
explanatory
• responseLength: the byte-length of the server response
• millis: execution time in milliseconds
• execStats: detailed statistics explaining the query’s
execution steps
• SHARDING_FILTER = mongos sharded query
• COLLSCAN = no index, 35k docs examined(!)
Metric Sources: db.serverStatus()
• A function that dumps status info about MongoDB’s current status
• Think “SHOW FULL STATUS” + “SHOW ENGINE INNODB STATUS”
• Sections
• Asserts
• backgroundFlushing
• connections
• dur (durability)
• extra_info
• globalLock + locks
• network
• opcounters
• opcountersRepl
• repl (replication)
• storageEngine
• mem (memory)
• metrics
• (Optional) wiredTiger
• (Optional) rocksdb
Metric Sources: db.serverStatus().rocksdb
Metric Sources: db.serverStatus().rocksdb
• ‘block-manager’: disks reads/writes
• ‘cache’: in-heap page cache
• Watch eviction modified vs
unmodified
• ‘cursor’: WiredTiger cursor ops/calls
• ‘log’: WiredTiger log stats
Metric Sources: db.serverStatus().wiredTiger
• ‘transaction’: checkpoint and trx info
• Watch max/min/avg checkpoint times
• ‘concurrentTransactions’: concurrency
ticket info (!)
• Increased with engine variable
Metric Sources: db.serverStatus().wiredTiger
Metric Sources: db.serverStatus().wiredTiger
Metric Sources: rs.status()
• A function that dumps replication status
• Think “SHOW MASTER STATUS” or “SHOW
SLAVE STATUS”
• Contains
• Replication set name and term
• Member status
• State
• Optime state
• Election state
• Heartbeat state
Metric Sources: Cluster Metadata
• The “config” database on Cluster Config servers
• Contains
• actionlog (3.0+)
• changelog
• databases
• collections
• shards
• chunks
• settings
• mongos
• locks
• lockpings
Metric Sources: db.currentOp()
• A function that dumps status info about running operations and various lock/execution details
Metric Sources: Log Files
• Interesting details are logged to the mongod/mongos log files
• Slow queries
• Storage engine details (sometimes)
• Index operations
• Chunk moves
• Connections
Monitoring: Percona PMM
• Open-source
monitoring from
Percona!
• Based on open-
source technology
• Simple deployment
• Examples in this
demo are from PMM
• 800+ metrics per
ping
Monitoring: Prometheus + Grafana
• Percona-Lab GitHub
• grafana_mongodb_dashboards for Grafana
• prometheus_mongodb_exporter for Prometheus
• Sources
• db.serverStatus()
• rs.status()
• sh.status()
• Config-server metadata
• Others and more soon..
• Supports MMAPv1, WT and RocksDB
• node_exporter for Prometheus
• OS-level (mostly Linux) exporter
Monitoring: Prometheus + Grafana
Usual Performance Suspects
• Locking
• Collection-level locks
• Document-level locks
• Software mutex/semaphore
• Limits
• Max connections
• Operation rate limits
• Resource limits
• Resources
• Lack of IOPS, RAM, CPU,
network, etc
MongoDB Resources and Consumers
• CPU
• System CPU
• FS cache
• Networking
• Disk I/O
• Threading
• User CPU (MongoDB)
• Compression (WiredTiger and RocksDB)
• Session Management
• BSON (de)serialisation
• Filtering / scanning / sorting
• Optimiser
• Disk
• Data file read/writes
• Journaling
• Error logging
• Network
• Query request/response
• Replication
High-Level OS Resources
• CPU
• CPU Load Averages
• thread-per-connection
• User vs System CPU
• System is kernel-level
• User is usually Mongo
• IOWAIT
• Can also include network
waits
• IO Time Spent
• “The canary in the gold
mine”
High-Level OS Resources
• Process Count
• 1 connection = 1 fork()
• Context Switches
• High switches can ==
too few CPUs
• Memory
• True used % without
caches/buffers
• Cached / Buffers
• Needed for block-
caching
• Disk
• Free space percent(!)
• LSM trees use more disk
MMAPv1: Page Faults
• Linux/Operating System
• Data pages in RAM are swapped to disk due to no free memory
• MongoDB MMAPv1
• Data is read/written to data file blocks that are not in RAM
• Some page faults are expected but a high rate is suspicious
• A high rate often indicates:
• A working set too large for RAM (or cache size)
• Inefficient patterns (eg: missing index)
• Too many indices vs updates
• A cold-focused access pattern
MMAPv1: Lock Ratio / Percent
• MMAPv1
• Lock Ratio/Percent indicates rate of collection-
level locking
• ‘db.serverStatus.globalLock.ratio’ in older
versions
• ‘db.serverStatus.locks’ in newer versions
• RocksDB and WiredTiger
• Global, DB and Collections Locks are “intent”
locks/non-blocking
MMAPv1: Fragmentation
• Can cause serious slowdowns on scans, range
queries, etc
• db.<collection>.stats()
• Shows various storage info for a collection
• Fragmentation can be computed by dividing
‘storageSize’ by ‘size’
• Any value > 1 indicates fragmentation
• Compact when you near a value of 2 by
rebuilding secondaries or using the
‘compact’ command
• WiredTiger and RocksDB have little/no
fragmentation
MMAPv1: Background Flushing
• Stats on the count/time taken to flush in the background
• If ‘average_ms’ grow continuously, writes will eventually go direct to disk based on:
• Linux sysctl ‘vm.dirty_ratio’
• Writes go to disk if dirty page ratio exceeds this number
• Linux sysctl ‘vm.dirty_background_ratio’
Rollbacks
• JSON file written to ‘rollback’ dir on-disk when PRIMARY crashes
when ahead of SECONDARYs
• Monitor for this file existing
WiredTiger + RocksDB: Checkpoints/Compactions
• Moves changes to real data files
• Causes a massive spike in disk I/O
• Monitor in combination with
• CPU IOWAIT %
• Disk IO Time Spent
Replication Lag and Oplog Time Range
• Replication in MongoDB is lightweight BUT it is
single threaded
• Shard for more replication throughput
• Replication Lag/Delay
• Subtract PRIMARY and SECONDARY ‘optime’
• Oplog Time Range
• Length of oplog from start -> finish
• Equal to the amount of time to rebuild a node
without needing a full re-sync!
• More oplog changes == shorter time range
Scanned and Moved
• Indicates random read or write I/O
• Scanned
• Number of documents/objects scanned
• A high rate indicates inefficient query
patterns, lack of indices, etc
• Moved
• Usually happens in MMAPv1 only
• Document is too big to be written in-
place and is moved elsewhere
Network
• Max connections
• Ensure max available connections is not exceed
• 1 connection = roughly 1MB of RAM!
• Consider connection pools if too many connections are needed
Low-level OS Resources
• Linux Virtual Memory
• vm.swappiness vs swapping rate
• vm.dirty_ratio vs op latency
• Consider lowering to match RAID controller
• Filesystem cached vs Block-device read-ahead
• Linux Network Stack
• Throughput vs total capacity
• SYN Backlogs for TCP
• TIME_WAIT connections
• Network errors/retransmit
• Disk
• Average wait time
• Percent utilisation
High-level Monitoring Tips
• Polling Frequency
• A lot can happen in 1-10 seconds!
• History
• Have another app/launch to compare with
• Annotate maintenances, launches, DDoS, important
events
• What to Monitor
• Fetch more than you graph, there’s no time machine
• (IMHO) monitor until it hurts, then just a bit less than
that
Questions?
DATABASE PERFORMANCE
MATTERS

More Related Content

What's hot

Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...Spark Summit
 
PostgreSQL HA
PostgreSQL   HAPostgreSQL   HA
PostgreSQL HAharoonm
 
PostgreSQL Replication in 10 Minutes - SCALE
PostgreSQL Replication in 10  Minutes - SCALEPostgreSQL Replication in 10  Minutes - SCALE
PostgreSQL Replication in 10 Minutes - SCALEPostgreSQL Experts, Inc.
 
Problems with PostgreSQL on Multi-core Systems with MultiTerabyte Data
Problems with PostgreSQL on Multi-core Systems with MultiTerabyte DataProblems with PostgreSQL on Multi-core Systems with MultiTerabyte Data
Problems with PostgreSQL on Multi-core Systems with MultiTerabyte DataJignesh Shah
 
PostgreSQL High Availability in a Containerized World
PostgreSQL High Availability in a Containerized WorldPostgreSQL High Availability in a Containerized World
PostgreSQL High Availability in a Containerized WorldJignesh Shah
 
Evolution of MongoDB Replicaset and Its Best Practices
Evolution of MongoDB Replicaset and Its Best PracticesEvolution of MongoDB Replicaset and Its Best Practices
Evolution of MongoDB Replicaset and Its Best PracticesMydbops
 
Lessons Learned From Running Spark On Docker
Lessons Learned From Running Spark On DockerLessons Learned From Running Spark On Docker
Lessons Learned From Running Spark On DockerSpark Summit
 
Patroni - HA PostgreSQL made easy
Patroni - HA PostgreSQL made easyPatroni - HA PostgreSQL made easy
Patroni - HA PostgreSQL made easyAlexander Kukushkin
 
Deploying postgre sql on amazon ec2
Deploying postgre sql on amazon ec2 Deploying postgre sql on amazon ec2
Deploying postgre sql on amazon ec2 Denish Patel
 
High Availability PostgreSQL with Zalando Patroni
High Availability PostgreSQL with Zalando PatroniHigh Availability PostgreSQL with Zalando Patroni
High Availability PostgreSQL with Zalando PatroniZalando Technology
 
Elastic HBase on Mesos - HBaseCon 2015
Elastic HBase on Mesos - HBaseCon 2015Elastic HBase on Mesos - HBaseCon 2015
Elastic HBase on Mesos - HBaseCon 2015Cosmin Lehene
 
Ilya Kosmodemiansky - An ultimate guide to upgrading your PostgreSQL installa...
Ilya Kosmodemiansky - An ultimate guide to upgrading your PostgreSQL installa...Ilya Kosmodemiansky - An ultimate guide to upgrading your PostgreSQL installa...
Ilya Kosmodemiansky - An ultimate guide to upgrading your PostgreSQL installa...PostgreSQL-Consulting
 
hbaseconasia2017: Large scale data near-line loading method and architecture
hbaseconasia2017: Large scale data near-line loading method and architecturehbaseconasia2017: Large scale data near-line loading method and architecture
hbaseconasia2017: Large scale data near-line loading method and architectureHBaseCon
 
Countdown to PostgreSQL v9.5 - Foriegn Tables can be part of Inheritance Tree
Countdown to PostgreSQL v9.5 - Foriegn Tables can be part of Inheritance Tree Countdown to PostgreSQL v9.5 - Foriegn Tables can be part of Inheritance Tree
Countdown to PostgreSQL v9.5 - Foriegn Tables can be part of Inheritance Tree Ashnikbiz
 
Application Caching: The Hidden Microservice (SAConf)
Application Caching: The Hidden Microservice (SAConf)Application Caching: The Hidden Microservice (SAConf)
Application Caching: The Hidden Microservice (SAConf)Scott Mansfield
 
On The Building Of A PostgreSQL Cluster
On The Building Of A PostgreSQL ClusterOn The Building Of A PostgreSQL Cluster
On The Building Of A PostgreSQL ClusterSrihari Sriraman
 
Lessons PostgreSQL learned from commercial databases, and didn’t
Lessons PostgreSQL learned from commercial databases, and didn’tLessons PostgreSQL learned from commercial databases, and didn’t
Lessons PostgreSQL learned from commercial databases, and didn’tPGConf APAC
 
hbaseconasia2017: hbase-2.0.0
hbaseconasia2017: hbase-2.0.0hbaseconasia2017: hbase-2.0.0
hbaseconasia2017: hbase-2.0.0HBaseCon
 
PostreSQL HA and DR Setup & Use Cases
PostreSQL HA and DR Setup & Use CasesPostreSQL HA and DR Setup & Use Cases
PostreSQL HA and DR Setup & Use CasesAshnikbiz
 

What's hot (20)

Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...
 
PostgreSQL HA
PostgreSQL   HAPostgreSQL   HA
PostgreSQL HA
 
PostgreSQL Replication in 10 Minutes - SCALE
PostgreSQL Replication in 10  Minutes - SCALEPostgreSQL Replication in 10  Minutes - SCALE
PostgreSQL Replication in 10 Minutes - SCALE
 
Problems with PostgreSQL on Multi-core Systems with MultiTerabyte Data
Problems with PostgreSQL on Multi-core Systems with MultiTerabyte DataProblems with PostgreSQL on Multi-core Systems with MultiTerabyte Data
Problems with PostgreSQL on Multi-core Systems with MultiTerabyte Data
 
PostgreSQL High Availability in a Containerized World
PostgreSQL High Availability in a Containerized WorldPostgreSQL High Availability in a Containerized World
PostgreSQL High Availability in a Containerized World
 
Evolution of MongoDB Replicaset and Its Best Practices
Evolution of MongoDB Replicaset and Its Best PracticesEvolution of MongoDB Replicaset and Its Best Practices
Evolution of MongoDB Replicaset and Its Best Practices
 
Lessons Learned From Running Spark On Docker
Lessons Learned From Running Spark On DockerLessons Learned From Running Spark On Docker
Lessons Learned From Running Spark On Docker
 
Patroni - HA PostgreSQL made easy
Patroni - HA PostgreSQL made easyPatroni - HA PostgreSQL made easy
Patroni - HA PostgreSQL made easy
 
Deploying postgre sql on amazon ec2
Deploying postgre sql on amazon ec2 Deploying postgre sql on amazon ec2
Deploying postgre sql on amazon ec2
 
High Availability PostgreSQL with Zalando Patroni
High Availability PostgreSQL with Zalando PatroniHigh Availability PostgreSQL with Zalando Patroni
High Availability PostgreSQL with Zalando Patroni
 
Elastic HBase on Mesos - HBaseCon 2015
Elastic HBase on Mesos - HBaseCon 2015Elastic HBase on Mesos - HBaseCon 2015
Elastic HBase on Mesos - HBaseCon 2015
 
Ilya Kosmodemiansky - An ultimate guide to upgrading your PostgreSQL installa...
Ilya Kosmodemiansky - An ultimate guide to upgrading your PostgreSQL installa...Ilya Kosmodemiansky - An ultimate guide to upgrading your PostgreSQL installa...
Ilya Kosmodemiansky - An ultimate guide to upgrading your PostgreSQL installa...
 
hbaseconasia2017: Large scale data near-line loading method and architecture
hbaseconasia2017: Large scale data near-line loading method and architecturehbaseconasia2017: Large scale data near-line loading method and architecture
hbaseconasia2017: Large scale data near-line loading method and architecture
 
Countdown to PostgreSQL v9.5 - Foriegn Tables can be part of Inheritance Tree
Countdown to PostgreSQL v9.5 - Foriegn Tables can be part of Inheritance Tree Countdown to PostgreSQL v9.5 - Foriegn Tables can be part of Inheritance Tree
Countdown to PostgreSQL v9.5 - Foriegn Tables can be part of Inheritance Tree
 
Application Caching: The Hidden Microservice (SAConf)
Application Caching: The Hidden Microservice (SAConf)Application Caching: The Hidden Microservice (SAConf)
Application Caching: The Hidden Microservice (SAConf)
 
On The Building Of A PostgreSQL Cluster
On The Building Of A PostgreSQL ClusterOn The Building Of A PostgreSQL Cluster
On The Building Of A PostgreSQL Cluster
 
EVCache & Moneta (GoSF)
EVCache & Moneta (GoSF)EVCache & Moneta (GoSF)
EVCache & Moneta (GoSF)
 
Lessons PostgreSQL learned from commercial databases, and didn’t
Lessons PostgreSQL learned from commercial databases, and didn’tLessons PostgreSQL learned from commercial databases, and didn’t
Lessons PostgreSQL learned from commercial databases, and didn’t
 
hbaseconasia2017: hbase-2.0.0
hbaseconasia2017: hbase-2.0.0hbaseconasia2017: hbase-2.0.0
hbaseconasia2017: hbase-2.0.0
 
PostreSQL HA and DR Setup & Use Cases
PostreSQL HA and DR Setup & Use CasesPostreSQL HA and DR Setup & Use Cases
PostreSQL HA and DR Setup & Use Cases
 

Similar to Monitoring MongoDB’s Engines in the Wild

Deployment Strategy
Deployment StrategyDeployment Strategy
Deployment StrategyMongoDB
 
Deployment Strategies
Deployment StrategiesDeployment Strategies
Deployment StrategiesMongoDB
 
SQL Explore 2012: P&T Part 1
SQL Explore 2012: P&T Part 1SQL Explore 2012: P&T Part 1
SQL Explore 2012: P&T Part 1sqlserver.co.il
 
Deployment Strategies (Mongo Austin)
Deployment Strategies (Mongo Austin)Deployment Strategies (Mongo Austin)
Deployment Strategies (Mongo Austin)MongoDB
 
Presto At Treasure Data
Presto At Treasure DataPresto At Treasure Data
Presto At Treasure DataTaro L. Saito
 
Oracle GoldenGate Architecture Performance
Oracle GoldenGate Architecture PerformanceOracle GoldenGate Architecture Performance
Oracle GoldenGate Architecture PerformanceEnkitec
 
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]Speedment, Inc.
 
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]Malin Weiss
 
Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)
Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)
Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)Bobby Curtis
 
How does Apache Pegasus (incubating) community develop at SensorsData
How does Apache Pegasus (incubating) community develop at SensorsDataHow does Apache Pegasus (incubating) community develop at SensorsData
How does Apache Pegasus (incubating) community develop at SensorsDataacelyc1112009
 
(ATS6-PLAT06) Maximizing AEP Performance
(ATS6-PLAT06) Maximizing AEP Performance(ATS6-PLAT06) Maximizing AEP Performance
(ATS6-PLAT06) Maximizing AEP PerformanceBIOVIA
 
OGG Architecture Performance
OGG Architecture PerformanceOGG Architecture Performance
OGG Architecture PerformanceEnkitec
 
NYJavaSIG - Big Data Microservices w/ Speedment
NYJavaSIG - Big Data Microservices w/ SpeedmentNYJavaSIG - Big Data Microservices w/ Speedment
NYJavaSIG - Big Data Microservices w/ SpeedmentSpeedment, Inc.
 
MongoDB: How We Did It – Reanimating Identity at AOL
MongoDB: How We Did It – Reanimating Identity at AOLMongoDB: How We Did It – Reanimating Identity at AOL
MongoDB: How We Did It – Reanimating Identity at AOLMongoDB
 
The Care + Feeding of a Mongodb Cluster
The Care + Feeding of a Mongodb ClusterThe Care + Feeding of a Mongodb Cluster
The Care + Feeding of a Mongodb ClusterChris Henry
 
SQL Server 2014 In-Memory OLTP
SQL Server 2014 In-Memory OLTPSQL Server 2014 In-Memory OLTP
SQL Server 2014 In-Memory OLTPTony Rogerson
 
Best And Worst Practices Deploying IBM Connections
Best And Worst Practices Deploying IBM ConnectionsBest And Worst Practices Deploying IBM Connections
Best And Worst Practices Deploying IBM ConnectionsLetsConnect
 
Managing Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using ElasticsearchManaging Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using ElasticsearchJoe Alex
 
Mongo db admin_20110329
Mongo db admin_20110329Mongo db admin_20110329
Mongo db admin_20110329radiocats
 
Azure storage deep dive
Azure storage deep diveAzure storage deep dive
Azure storage deep diveYves Goeleven
 

Similar to Monitoring MongoDB’s Engines in the Wild (20)

Deployment Strategy
Deployment StrategyDeployment Strategy
Deployment Strategy
 
Deployment Strategies
Deployment StrategiesDeployment Strategies
Deployment Strategies
 
SQL Explore 2012: P&T Part 1
SQL Explore 2012: P&T Part 1SQL Explore 2012: P&T Part 1
SQL Explore 2012: P&T Part 1
 
Deployment Strategies (Mongo Austin)
Deployment Strategies (Mongo Austin)Deployment Strategies (Mongo Austin)
Deployment Strategies (Mongo Austin)
 
Presto At Treasure Data
Presto At Treasure DataPresto At Treasure Data
Presto At Treasure Data
 
Oracle GoldenGate Architecture Performance
Oracle GoldenGate Architecture PerformanceOracle GoldenGate Architecture Performance
Oracle GoldenGate Architecture Performance
 
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
 
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
 
Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)
Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)
Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)
 
How does Apache Pegasus (incubating) community develop at SensorsData
How does Apache Pegasus (incubating) community develop at SensorsDataHow does Apache Pegasus (incubating) community develop at SensorsData
How does Apache Pegasus (incubating) community develop at SensorsData
 
(ATS6-PLAT06) Maximizing AEP Performance
(ATS6-PLAT06) Maximizing AEP Performance(ATS6-PLAT06) Maximizing AEP Performance
(ATS6-PLAT06) Maximizing AEP Performance
 
OGG Architecture Performance
OGG Architecture PerformanceOGG Architecture Performance
OGG Architecture Performance
 
NYJavaSIG - Big Data Microservices w/ Speedment
NYJavaSIG - Big Data Microservices w/ SpeedmentNYJavaSIG - Big Data Microservices w/ Speedment
NYJavaSIG - Big Data Microservices w/ Speedment
 
MongoDB: How We Did It – Reanimating Identity at AOL
MongoDB: How We Did It – Reanimating Identity at AOLMongoDB: How We Did It – Reanimating Identity at AOL
MongoDB: How We Did It – Reanimating Identity at AOL
 
The Care + Feeding of a Mongodb Cluster
The Care + Feeding of a Mongodb ClusterThe Care + Feeding of a Mongodb Cluster
The Care + Feeding of a Mongodb Cluster
 
SQL Server 2014 In-Memory OLTP
SQL Server 2014 In-Memory OLTPSQL Server 2014 In-Memory OLTP
SQL Server 2014 In-Memory OLTP
 
Best And Worst Practices Deploying IBM Connections
Best And Worst Practices Deploying IBM ConnectionsBest And Worst Practices Deploying IBM Connections
Best And Worst Practices Deploying IBM Connections
 
Managing Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using ElasticsearchManaging Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using Elasticsearch
 
Mongo db admin_20110329
Mongo db admin_20110329Mongo db admin_20110329
Mongo db admin_20110329
 
Azure storage deep dive
Azure storage deep diveAzure storage deep dive
Azure storage deep dive
 

Monitoring MongoDB’s Engines in the Wild

  • 1. Tim Vaillancourt Sr. Technical Operations Architect Monitoring MongoDB’s Engines in the Wild
  • 2. About Me • Joined Percona in January 2016 • Sr Technical Operations Architect for MongoDB • Previous: • EA DICE (MySQL DBA) • EA SPORTS (Sys/NoSQL DBA Ops) • Amazon/AbeBooks Inc (Sys/MySQL+NoSQL DBA Ops) • Main techs: MySQL, MongoDB, Cassandra, Solr, Redis, queues, etc • 10+ years tuning Linux for database workloads (off and on) • Monitoring techs • Nagios • MRTG • Munin • Zabbix • Cacti • Graphite • Prometheus
  • 3. Storage Engines • MMAPv1 • Mostly done by Linux kernel • WiredTiger • Default as of 3.2 • Percona In-Memory • Same metrics as WiredTiger • RocksDB • PerconaFT / TokuMX • Deprecated • Fractal-tree based storage engine
  • 4. Storage Engines?! The New SE API • Introduced in MongoDB 3.0 • Abstraction layer for storage-level interaction • Allowed integration of WiredTiger and other features
  • 5. Storage Engines: MMAPv1 • Default storage engine < 3.2 (now WiredTiger) • Collection-level locking (common performance bottleneck) • Monitored via Lock Ratio/Percent metrics • In-place datafile updating (when possible) • OS-level operations • Uses OS-level mmap() to map BSON files on disk <=> memory • Uses OS-level filesystem cache as block cache • Much low(er) monitoring visibility • Database metrics must be gathered from OS-level • OS-level metrics are more vague
  • 6. Storage Engines: MMAPv1 • Document read path • Try to load from cache • If not in cache, load from BSON file on disk • Document update/write path • Try to update document in-place • If too big, “move” document on disk until a free space is found
  • 7. • New default engine as of 3.2 • Standalone LSM engine acquired by MongoDB Inc • BTree-Based under MongoDB • Integrated using Storage Engine API • Document-level locking • Built-in compression • Index prefix compression • MVCC and Concurrency Limits • High parallelism / CPU utilisation Storage Engines: WiredTiger
  • 8. Storage Engines: WiredTiger • Document Write Path • Update, delete or write is written to WT log • Changes to data files are performed by checkpointing later • Document Read Path • Looks for data in in-heap cache • Looks for data in the WT log • Goes to data files for the data • Kernel will look in filesystem cache, uncompress result if exists • If not in FS cache, read from disk and uncompress result • Switch compression algorithms if CPU is too high
  • 9. Storage Engines: RocksDB / MongoRocks • MongoRocks developed by • Tiered level compaction strategy • First layer is called the MemTable • N number of on-disk levels • Compaction is triggered when any level is full • In-heap Block Cache (default 30% RAM) • Holds uncompressed data • BlockCache reduces compression CPU hit • Kernel-level Page Cache for compressed data • Space amplification of LSM is about +10% • Optional ‘counters’: storage.rocksdb.counters
  • 10. Storage Engines: RocksDB / MongoRocks • Document Write path • Updates, Deletes and Writes go to Meltable and complete • Compaction resolves multi-versions of data in the background • Document Read path • Looks for data in MemTable • Level 0 to Level N is asked for the data • Data is read from filesystem cache, if present, then uncompressed • Or, bloom filter is used to find data file, then data is read and uncompressed
  • 11. Storage Engines: RocksDB / MongoRocks • Watch for • Pending compactions • Stalls • Indicates compaction system is overwhelmed, possibly due to I/O • Level Read Latencies • If high, disk throughput may be too low • Rate of compaction in bytes vs any noticeable slowdown • Rate of deletes vs read latency • Deletes add expense to reads and compaction
  • 12. Metric Sources: operationProfiling • Writes slow database operations to a new MongoDB collection for analysis • Capped Collection: “system.profile” in each database, default 100mb • The collection is capped, ie: profile data doesn’t last forever • Support for operationProfiling data in Percona Monitoring and Management in current future goals • Enable operationProfiling in “slowOp” mode • Start with a very high threshold and decrease it in steps • Usually 50-100ms is a good threshold • Enable in mongod.conf operationProfiling: slowOpThresholdMs: 100 mode: slowOp Or the command-line way… mongod <other-flags> —profile 1 —slowms 100
  • 13. Metric Sources: operationProfiling • op/ns/query: type, namespace and query of a profile • keysExamined: # of index keys examined • docsExamined: # of docs examined to achieve result • writeConflicts: # of WCE encountered during update • numYields: # of times operation yielded for others • locks: detailed lock statistics
  • 14. Metric Sources: operationProfiling • nreturned: # of documents returned by the operation • nmoved: # of documents moved on disk by the operation • ndeleted/ninserted/nMatched/nModified: self explanatory • responseLength: the byte-length of the server response • millis: execution time in milliseconds • execStats: detailed statistics explaining the query’s execution steps • SHARDING_FILTER = mongos sharded query • COLLSCAN = no index, 35k docs examined(!)
  • 15. Metric Sources: db.serverStatus() • A function that dumps status info about MongoDB’s current status • Think “SHOW FULL STATUS” + “SHOW ENGINE INNODB STATUS” • Sections • Asserts • backgroundFlushing • connections • dur (durability) • extra_info • globalLock + locks • network • opcounters • opcountersRepl • repl (replication) • storageEngine • mem (memory) • metrics • (Optional) wiredTiger • (Optional) rocksdb
  • 18. • ‘block-manager’: disks reads/writes • ‘cache’: in-heap page cache • Watch eviction modified vs unmodified • ‘cursor’: WiredTiger cursor ops/calls • ‘log’: WiredTiger log stats Metric Sources: db.serverStatus().wiredTiger
  • 19. • ‘transaction’: checkpoint and trx info • Watch max/min/avg checkpoint times • ‘concurrentTransactions’: concurrency ticket info (!) • Increased with engine variable Metric Sources: db.serverStatus().wiredTiger
  • 21. Metric Sources: rs.status() • A function that dumps replication status • Think “SHOW MASTER STATUS” or “SHOW SLAVE STATUS” • Contains • Replication set name and term • Member status • State • Optime state • Election state • Heartbeat state
  • 22. Metric Sources: Cluster Metadata • The “config” database on Cluster Config servers • Contains • actionlog (3.0+) • changelog • databases • collections • shards • chunks • settings • mongos • locks • lockpings
  • 23. Metric Sources: db.currentOp() • A function that dumps status info about running operations and various lock/execution details
  • 24. Metric Sources: Log Files • Interesting details are logged to the mongod/mongos log files • Slow queries • Storage engine details (sometimes) • Index operations • Chunk moves • Connections
  • 25. Monitoring: Percona PMM • Open-source monitoring from Percona! • Based on open- source technology • Simple deployment • Examples in this demo are from PMM • 800+ metrics per ping
  • 26. Monitoring: Prometheus + Grafana • Percona-Lab GitHub • grafana_mongodb_dashboards for Grafana • prometheus_mongodb_exporter for Prometheus • Sources • db.serverStatus() • rs.status() • sh.status() • Config-server metadata • Others and more soon.. • Supports MMAPv1, WT and RocksDB • node_exporter for Prometheus • OS-level (mostly Linux) exporter
  • 28. Usual Performance Suspects • Locking • Collection-level locks • Document-level locks • Software mutex/semaphore • Limits • Max connections • Operation rate limits • Resource limits • Resources • Lack of IOPS, RAM, CPU, network, etc
  • 29. MongoDB Resources and Consumers • CPU • System CPU • FS cache • Networking • Disk I/O • Threading • User CPU (MongoDB) • Compression (WiredTiger and RocksDB) • Session Management • BSON (de)serialisation • Filtering / scanning / sorting • Optimiser • Disk • Data file read/writes • Journaling • Error logging • Network • Query request/response • Replication
  • 30. High-Level OS Resources • CPU • CPU Load Averages • thread-per-connection • User vs System CPU • System is kernel-level • User is usually Mongo • IOWAIT • Can also include network waits • IO Time Spent • “The canary in the gold mine”
  • 31. High-Level OS Resources • Process Count • 1 connection = 1 fork() • Context Switches • High switches can == too few CPUs • Memory • True used % without caches/buffers • Cached / Buffers • Needed for block- caching • Disk • Free space percent(!) • LSM trees use more disk
  • 32. MMAPv1: Page Faults • Linux/Operating System • Data pages in RAM are swapped to disk due to no free memory • MongoDB MMAPv1 • Data is read/written to data file blocks that are not in RAM • Some page faults are expected but a high rate is suspicious • A high rate often indicates: • A working set too large for RAM (or cache size) • Inefficient patterns (eg: missing index) • Too many indices vs updates • A cold-focused access pattern
  • 33. MMAPv1: Lock Ratio / Percent • MMAPv1 • Lock Ratio/Percent indicates rate of collection- level locking • ‘db.serverStatus.globalLock.ratio’ in older versions • ‘db.serverStatus.locks’ in newer versions • RocksDB and WiredTiger • Global, DB and Collections Locks are “intent” locks/non-blocking
  • 34. MMAPv1: Fragmentation • Can cause serious slowdowns on scans, range queries, etc • db.<collection>.stats() • Shows various storage info for a collection • Fragmentation can be computed by dividing ‘storageSize’ by ‘size’ • Any value > 1 indicates fragmentation • Compact when you near a value of 2 by rebuilding secondaries or using the ‘compact’ command • WiredTiger and RocksDB have little/no fragmentation
  • 35. MMAPv1: Background Flushing • Stats on the count/time taken to flush in the background • If ‘average_ms’ grow continuously, writes will eventually go direct to disk based on: • Linux sysctl ‘vm.dirty_ratio’ • Writes go to disk if dirty page ratio exceeds this number • Linux sysctl ‘vm.dirty_background_ratio’
  • 36. Rollbacks • JSON file written to ‘rollback’ dir on-disk when PRIMARY crashes when ahead of SECONDARYs • Monitor for this file existing
  • 37. WiredTiger + RocksDB: Checkpoints/Compactions • Moves changes to real data files • Causes a massive spike in disk I/O • Monitor in combination with • CPU IOWAIT % • Disk IO Time Spent
  • 38. Replication Lag and Oplog Time Range • Replication in MongoDB is lightweight BUT it is single threaded • Shard for more replication throughput • Replication Lag/Delay • Subtract PRIMARY and SECONDARY ‘optime’ • Oplog Time Range • Length of oplog from start -> finish • Equal to the amount of time to rebuild a node without needing a full re-sync! • More oplog changes == shorter time range
  • 39. Scanned and Moved • Indicates random read or write I/O • Scanned • Number of documents/objects scanned • A high rate indicates inefficient query patterns, lack of indices, etc • Moved • Usually happens in MMAPv1 only • Document is too big to be written in- place and is moved elsewhere
  • 40. Network • Max connections • Ensure max available connections is not exceed • 1 connection = roughly 1MB of RAM! • Consider connection pools if too many connections are needed
  • 41. Low-level OS Resources • Linux Virtual Memory • vm.swappiness vs swapping rate • vm.dirty_ratio vs op latency • Consider lowering to match RAID controller • Filesystem cached vs Block-device read-ahead • Linux Network Stack • Throughput vs total capacity • SYN Backlogs for TCP • TIME_WAIT connections • Network errors/retransmit • Disk • Average wait time • Percent utilisation
  • 42. High-level Monitoring Tips • Polling Frequency • A lot can happen in 1-10 seconds! • History • Have another app/launch to compare with • Annotate maintenances, launches, DDoS, important events • What to Monitor • Fetch more than you graph, there’s no time machine • (IMHO) monitor until it hurts, then just a bit less than that