Troubleshooting Cassandra 2.1: A Guided Tour of nodetool and system.log. From Cassandra Summit 2015. Download and check out the presenter notes for tips!
I’ll give a general lay of the land for troubleshooting Cassandra. Then I’ll take you on a deep dive through nodetool and system.log and give you a guided tour of the useful information they provide for troubleshooting. I’ll devote special attention to monitoring the various processes that Cassandra uses to do its work and how to effectively search for information about specific error messages online.
I’m a lead support engineer at DataStax
I’ve been at DataStax just over 3 years and this is my fourth summit
Overall troubleshooting process from a support engineer’s perspective
I’ll focus on the tools Cassandra gives you to do steps 1 to 4
Steps 5 and 6 are the hard parts; but luckily that’s what DataStax support, or the mailing list, or StackOverflow can help you with
Doing the legwork on the first part will make the second part happen much faster
It’s also helpful to keep in mind the various system resources that Cassandra consumes
CPU
Consider both single core and multi-core utilization
Some processes are single-threaded and bottlenecked by a single core
Memory
Heap space, typically limited to a subset of your total physical memory
Cassandra stores many objects off heap to avoid Garbage Collection issues
The OS will use whatever memory is left over for page cache; make sure you leave enough free memory for this
Disk
Available disk space
I/O bandwidth utilization
Network
Primarily concerned with bandwidth
Keep in mind firewalls; the path needs to be open
OS Resources/Limits
File handles, processes, etc.
Make sure you set high enough limits in limits.conf or ulimits
Linux Monitoring Commands: http://www.tecmint.com/command-line-tools-to-monitor-linux-performance/
Overview of some of the most useful nodetool commands and what they do
Nodetool documentation: http://docs.datastax.com/en/cassandra/2.1/cassandra/tools/toolsNodetool_r.html
Nodetool status gives an overview of the entire cluster’s status
The cluster is divided into datacenter
The first column shows whether a node is up or down
The second column shows the node’s state, which one of:
normal
leaving the cluster
joining the cluster
moving its token
The IP address of each node.
This will be the broadcast_address if specified; otherwise, the listen_address
The amount of data on each node
It is normal for this to differ slightly
Large discrepancies could indicate a problem (in terms percentage)
Wide rows/partitions
Uneven racks
Compaction issues
Number of tokens
Normally 256 for vnodes, 1 for non-vnodes
Percentage of ring each node owns
Note message at the top
Without a keyspace, assumes SimpleStrategy with RF=1
Can cause strange readout when using multiple DCs with a small offset between tokens (nothing to worry about)
With keyspace, shows ownership according to RF in that keyspace
With keyspace, should add up to RF times 100%
May differ slightly from node to node with vnodes due to random token distribution
UUID of the node. Used to uniquely identify the node when running removenode command.
The rack the node is on
Used to avoid single point of failure
Avoid if not using vnodes
If using vnodes, ensure the same number of nodes are in each rack
In this example, we have one node that is down, so that is where we’d focus our investigation
Nodetool ring is an old way of checking status
Shows the same information as nodetool status, except for the token
When using vnodes, it will show every token on each node and becomes difficult to read
Nodetool info shows information specific to the node where it is run
Some of the information is also shown by nodetool ring/status
Same information shown by nodetool status and ring
Not going to rehash this
Whether gossip, thrift, and native transport are enabled
Can be individually enabled/disabled with nodetool commands
Gossip generation; increases each time the node is restarted
How many seconds the node has been running
Heap and off heap memory usage
Heap memory shows amount currently in use and maximum
Amount currently in use may include garbage
The amount of garbage included varies depending on when GC was last run
Off heap memory is stored outside the heap but adds to the process’s overall resident size
If a process gets killed by the kernel OOM killer, make sure it isn’t using too much off heap memory
Number of exceptions that occurred since last restart
Not every error involves an exception, but it’s still a good indicator
Key cache caches the location of partition keys in memory
Avoids using bloom filters to determine which sstables to read
Capacity defaults to 100MB or 5% of the heap, whichever is less
Row cache stores entire rows in memory (entire partitions prior to 2.1)
Disabled by default
Only enable in these circumstances:
Small, very hot data set that will fit in memory
Data should be read much more than written because writes invalidate row in cache
Prior to 2.1, only useful on small partitions (no clustering keys!)
OK to use with clustering keys in 2.1 because individual rows are cached
Holds hot counter values in memory
2.1 introduced more accurate counters
Performance cost due to read-before-write
Counter cache mitigates performance cost
Capacity defaults to 50MB or 2.5% of heap, whichever is less
Entries, size and capacity
- Entries is number of keys/rows/counters in the cache
Size is amount actually in use (bytes)
Capacity is amount specified by key_cache_size_in_mb in cassandra.yaml
If size is consistently much less than capacity, you have this set too high
Cache hits, total requests, and hit rate
- If hit rate is low, try increasing cache capacity slowly until you see diminishing returns
Caches are periodically saved to disk and reloaded when a node is restarted
This is to avoid a cold cache after restarts
tpstats shows thread pool statistics
Cassandra has various thread pools that handle important foreground and background tasks
This information is also logged to system.log by StatusLogger.java periodically or when a message is dropped
Name of the thread pool
Number of tasks being actively serviced by a thread
For several stages, this can be configured in cassandra.yaml
For others, it is equal to the number of CPU cores
For a few others, it is a hardcoded limit, usually 1
Number tasks waiting to be serviced
Unless limited via yaml setting, limit is ~2 billion
You’ll run out of memory long before you ever hit this limit
High number of pending tasks indicates the stage is overloaded
Number of tasks completed since last restart
Number of tasks currently blocked for I/O
Should almost always be 0
Total number of tasks blocked since last restart.
Usually zero except for FlushWriter
At the bottom is a list of message types and the number of messages dropped
Load shedding drops requests that have been pending beyond timeout specified in cassandra.yaml
Dropped messages usually indicate overloaded cluster; check for other causes, then add more nodes
Handles local reads for which this node is a replica
Number of threads is controlled by concurrent_reads in cassandra.yaml
Various reads, all handled by ReadStage
READ - normal read on a single partition
RANGE_SLICE - A sequential or secondary index scan over multiple partitions
PAGED_RANGE - Used for automatic paging when result size exceeds row limit
Handles local writes for which this node is a replica
Number of threads is controlled by concurrent_writes in cassandra.yaml
Various writes, handled by MutationStage
MUTATION - normal write
COUNTER_MUTATION - incrementing a counter
Coordinator uses this to process responses from other nodes
Roughly indicates how often this node has been a coordinator
Roughly because unless using CL ONE, need to handle responses from multiple nodes
Request completed but timed out before coordinator could respond to it
Timeout controlled by request_timeout_in_ms in cassandra.yaml
May indicate that coordinator is overloaded
Ensure that client load balancing is set up correctly
Make sure use of batches is appropriate
Logged batches only when atomicity is required
Unlogged batches only when updating multiple rows with the same partition key
Otherwise use asynchronous execution to pipeline requests without overloading a single coordinator
Writes memtables to disk
Number of threads controlled by memtable_flush_writers, should equal number of data drives
Maximum pending tasks controlled by memtable_flush_queue_size in cassandra.yaml
Once queue is full writes are blocked until another flush writer is available
Large number of all-time-blocked tasks indicates a disk bottleneck; add more/faster disks or more nodes
Handles compactions
Number of threads controlled by concurrent_compactors in cassandra.yaml
Constantly pending compactions means compactions can’t keep up with writes; mitigation strategies:
Switch from LeveledCompactionStrategy to SizeTieredCompactionStrategy for write-heavy tables
Get faster disks (SSD) if needed
Increase compaction throughput
Get faster CPU cores
Asynchronous read repairs
Occur for a certain percentage of reads, configurable per table
Pending tasks indicate that you may have read repair chance set too high
READ_REPAIR - dropped write due to a read repair
Can sometimes show up as a TimedOutException on the read that triggered it
Handles hint delivery from the coordinator to a node that’s recently come back up
Large number of hints usually means an unhealthy cluster
Nodes gossip with each other once a second
Completed GossipStage tasks is an easy way to tell approximately how long a node has been up
Prior to 2.0, when using vnodes with a large cluster, gossip could become CPU-bound and get behind
Migrates schema changes to other nodes
If you see pending tasks here, you’re making too many schema changes too quickly
Repair related stages
AntiEntropyStage coordinates repairs
AntiEntropySesssions are active repairs in progress
ValidationExecutor builds merkle trees for repair
cfstats shows statistics for individual tables
Tables are grouped by keyspace
Values apply only to the node where cfstats is run, not cluster-wide
Keyspace and table names
Multiple tables grouped under each keyspace
Read and write counts
Per table table and at keyspace level
Read and write latency
Per table and averaged at keyspace level
Number of flushes pending against a table
Per table and summed at keyspace level
Space used by the table on this node
Must sum across different nodes to get total space
May include deleted or updated data that hasn’t been compacted yet
Live and total values are usually equal
If some tables have been discarded but not deleted yet, total may be larger
Space used by snapshots for the table; if you’re running out of space, look here
Space consumed by off-heap data structures
Total off heap memory
Broken down by data structure: bloom filter, index summary, compression metadata
Number of sstables comprising a table
Broken down by level when using LeveledCompactionStrategy
Bloom filter statistics
False positives
Too high, performance will suffer
Reduce false positive chance for table
Space used
Lower false positive chance requires more space
Bloom filter grows linearly with number of partitions
Increase false positive if bloom filter is too large
Percent of original size once data is compressed (lower is better)
Compression trades higher CPU usage for lower I/O usage
Usually this is a good tradeoff
But if compression ratio is high, you may want to turn it off for this table
Number of partition keys in the table on this node
Estimated to the nearest index_interval (128 by default)
Rows spread across multiple sstables will inflate this number
Memtable information
number of entries in the memtable
bytes of data in the memtable
bytes of data in the memtable stored off heap
the number of times the memtable has been switched (flushed to disk)
Statistics on partition size, calculated during compaction
Maximum, minimum, and average size
Helps identify tables containing large partitions
Tombstone statistics
Number of live cells versus tombstones encountered when scanning a partition
Rolling average for the last 5 minutes
cfhistograms provides deeper insight into a specific table
Must specify keyspace and table name when calling nodetool cfhistograms
Information shown by cfhistograms is local to the node where it was run
Keyspace and table for which histograms were requested
Percentiles (percent of X less than this value)
Minimum and maximum values
50% aka median
The number of sstables that had to be scanned for each read query
Scanning more sstables is more expensive and will lead to higher read latencies
Reports counts for reads within the last 5 minutes (approximately)
Write and Read latency in microseconds
Remember to divide by 1000 to get ms
Reports latencies for the last 5 minutes (approximately)
This is the latency for local reads so it doesn’t include round trip time for the coordinator
The size of each partition on the node in bytes
In this example:
the largest partition is 2.23MB
the smallest is 1.29KB
the median partition size is 28.8KB
99% of partitions are 444KB or smaller
Number of cells in a partition
Partition key is stored separately (not counted here)
Cells are name/value pairs used to store column data
There will be one cell per non-key column in each row, plus one sentinel cell per row
cfhistograms provides deeper insight into a specific table
Must specify keyspace and table name when calling nodetool cfhistograms
Information shown by cfhistograms is local to the node where it was run
cfhistograms provides deeper insight into a specific table
Must specify keyspace and table name when calling nodetool cfhistograms
Information shown by cfhistograms is local to the node where it was run
cfhistograms provides deeper insight into a specific table
Must specify keyspace and table name when calling nodetool cfhistograms
Information shown by cfhistograms is local to the node where it was run
cfhistograms provides deeper insight into a specific table
Must specify keyspace and table name when calling nodetool cfhistograms
Information shown by cfhistograms is local to the node where it was run
cfhistograms provides deeper insight into a specific table
Must specify keyspace and table name when calling nodetool cfhistograms
Information shown by cfhistograms is local to the node where it was run
Shows network activity
Mode, same as on status: NORMAL, JOINING, LEAVING, MOVING
Active streams
Read repair statistics
Commands sent and responses received while acting as coordinator
Repair session IDs
Nodes exchanging data
Data is streaming over listen address instead of broadcast address
Useful on EC2 because Amazon doesn’t charge for internal traffic
Total number of files and bytes of data to be received from specified node
Total number of files and bytes of data to be sent to specified node
Specific files currently being transferred
Progress for each file
Number of read repairs attempted
Number of mismatches resolved
foreground
background
Number of pending and completed commands sent to other nodes
Number of pending and completed responses received from other nodes
Status of current and pending compactions
Number of pending compactions
Compactions in progress
Compaction Type
Compaction - normal compaction
Validation - building merkle trees for repair
Keyspace and table
Bytes complete and total bytes for each compaction
Percent done
Estimated time remaining for the active tasks
Not useful, because:
estimates can be wrong
doesn’t account for pending tasks
History of recent compactions
Shows how much space a compaction reclaimed
Same information available in system.log
Unique ID
Keyspace and column family
Unix timestamp when compaction occurred
Seconds since Jan 1, 1970
Total bytes before compaction
Total bytes after compaction
Row merge counts
Actually the last column on each row
Moved to make output fit on slide
Count of sstables
Number of rows spread across that many sstables prior to compaction
Used to check for schema disagreements
This is not the information we’re interested in
We’re looking to see how many schema versions there are in the cluster.
No disagreement; only one version of the schema shared by all nodes
Schema disagreement; one node has a different version from all the others
Schema disagreements must be manually resolved
If only one node disagrees, run nodetool resetlocalschema on that node
If multiple nodes disagree
shut down nodes in the minority and delete system/schema_* sstables
start nodes back up one by one
system.log is the most important tool for troubleshooting Cassandra
Basic format
Level: INFO, WARN, ERROR by default; DEBUG only if configured
Thread: use the ID to correlate messages from the same thread
Date & Time: use to correlate messages across multiple nodes, time duration of events
Source File/Line No: code that logged the message, not necessarily where an error occurred; talk about stack traces later
system.log is the most important tool for troubleshooting Cassandra
Basic format
Level: INFO, WARN, ERROR by default; DEBUG only if configured
Thread: use the ID to correlate messages from the same thread
Date & Time: use to correlate messages across multiple nodes, time duration of events
Source File/Line No: code that logged the message, not necessarily where an error occurred; talk about stack traces later
Exception - What kind of error occurred?
Exception - What kind of error occurred?
Stack trace – where did the error happen?
Most local at top to most global at bottom
Wall of text — we’ll dissect it in the next slides
Organization names – whose fault is it?
Sub-packages usually group major application subsystems
Class Name – specific object
Will usually, but not always match the filename
Nested Classes - $ separates outer class from inner class(es)
Method belongs to the inner class
Method name – what was the class doing?
<init> indicates that the error occurred in a static initialization block or constructor
File name – where the source code is, should you want to look at it
Also pay attention to the package name so you can find the file within the nested directory structure
When source-diving, start with the most local method and work your way out
Line number – where to look in the code (available on github.com/apache/cassandra)
Be careful! Line numbers change between versions, so make sure you select the the right version in github
Pay attention to nested exceptions
Each exception has its own stack trace which may be completely different
The outer exception may be too general because it’s been rethrown from unrelated code
The innermost exception will be the actual root cause of the error
Best to use a combination of outer and inner exception as search terms
Exception will usually have an error message
Provides additional information about the circumstances of the exception
Usually good to search for the exception and message together
Look out for embedded numbers or strings; these may change from one message to the next, and including them will undesirably narrow your search
- Some additional examples of organizations and subsystems
- Some additional examples of organizations and subsystems
Use exception and several package+class+method names
Exception alone often isn’t sufficient because the same error can occur many different places
You’ll find the same exception in unrelated software if it’s a standard java exception.
Add several package/class/method combinations to narrow down the exception
Use at least the topmost method and the first org.apache.cassandra method
Use quotes around individual elements (especially if they contain spaces)
Line numbers shouldn’t be part of your search criteria because you may not find the same error in a different version
Likewise, exclude specific numbers and strings like names and counts from your search
Use Google’s site: feature to narrow search terms to apache JIRA, cassandra mailing list, or stackoverflow
Add or remove additional methods as needed to narrow or broaden search
These might be a good set of search terms for this exception
Include both exceptions and methods from both stack traces
Include exception and error message grouped together inside quotes
Know how to recognize a restart
Check versions of major components and JVM
Confirm settings are what you think they are
Make sure you have JNA installed
Know when node is ready to serve requests
Cassandra writes updates to the commit log on disk and the memtables in memory
The memtables are flushed to sstables on disk as required
First the flush is enqueued
There are a limited number of FlushWriter threads
Flushes wait in the queue until a FlushWriter is available
- When a FlushWriter becomes available, the flush begins
Eventually the flush completes.
This is not an instantaneous process because disks are slow.
This is the name of the table
You can use this to link the enqueuing and writing of the flush
Make note of the MemtableFlushWriter thread doing the flush
You can use this to link the beginning and end of the flush
The thread that enqueues the memtable provides clues about why it was flushed:
When the memory fills up, the flush is enqueued by SlabPoolCleaner (for on-heap) or NativePoolCleaner (for off-heap)
If the commit log space fills up, the flush is enqueued by OptionalTasks
Other threads may enqueue flushes for other reasons
If you have lots of tiny sstables, it may be from flushing too frequently so try to figure out why
This shows the number of bytes stored in the memtable both on-heap and off-heap
Also shows the percentage of total space devoted to memtables that this one consumed
serialized bytes, larger than in-memory bytes, due to serialization overhead
- This shows the name of the sstable that the memtable was written to on disk
Note the times on the messages
Time between first and second messages is how long the flush waited in the queue
Time between second and third messages is how long the flush took to complete
Sstables are immutable; updates go into new sstables
Reads scan over multiple sstables to stitch together a row
SSTables must eventually be compacted together to keep reads fast
Size-tiered compactions occur when a sufficient number of similarly sized sstables exist
In leveled compaction, sstables are written to Level 0 and moved to higher levels as they are compacted
Leveled compactions occur continuously as long as sstables exist in Level 0
Note the CompactionExecutor thread doing the compaction
The thread ID can be used to link together the messages
The compaction is beginning
The compaction is complete
The sstables that are going to be compacted
How many tables were compacted
The name of the new sstable created by the compaction
The number of bytes in the original files
The number of bytes in the new file
The percentage of the original size after:
Updates were merged
Tombstones and expired TTLs were removed
The time the compaction took and the rate in MB/sec
The sum of the number of rows in each sstable
The number of unique rows across all compacted sstables
X:Y where Y rows were split across X sstables
X:Y where Y rows were split across X sstables
X:Y where Y rows were split across X sstables
During compaction, you may see one or more messages about partitions being compacted incrementally
Logical CQL 3 rows sharing the same partition key form a physical row when stored in the cluster
Newer versions of Cassandra say partition instead of row
Large partitions can cause a number of problems for Cassandra
Uneven distribution of data between nodes
Large memory usage when large partitions are read all at once
Generating lots of garbage compacting a wide row
Slower compactions because they’re done incrementally on disk
These messages can help identify large rows
The keyspace, table, and partition key
The size of the partition
Garbage collections are a necessary evil
Some garbage collections run concurrently with Cassandra, but others stop the world
Cassandra logs any stop-the-world collections that last longer than 200ms
Stop the world collections cause nodes to stop responding to gossip and client requests
This shows the number of ms elapsed for the collection
Long collections increase latency of read and write requests
Very long collections will prevent the node from gossiping and other nodes will think it’s down
Note the time on each GCInspector message.
Even if the individual collections are fast, too many collections within a short timespan can hurt throughput
This shows the amount of data in each generation before and after the collection
Java divides the heap into different generations and uses different GC approaches on each
This is based on the observation that objects either tend to be short-lived or long-lived and different approaches work better in each scenario
The size of different spaces can be tweaked as well as the rules for moving objects between them
Java allocates new objects into the eden space
Eden objects that survive a single collection are promoted to the survivor space
Survivor objects that survive a specified number of collections get further promoted to the tenured generation
This shows the type of collection that occurred
ParNew collections occur when the young gen is collected. These are stop-the-world.
The young gen is usually small so ParNew is usually fast
If there’s not enough contiguous space in the old gen to promote an object, the old gen must be compacted, which takes a long time
ParNew collections occur when the young gen is collected. These are stop-the-world.
The young gen is usually small so ParNew is usually fast
If there’s not enough contiguous space in the old gen to promote an object, the old gen must be compacted, which takes a long time
ConcurrentMarkSweep normally runs concurrently with the application and does not stop the world
If the concurrent collection can’t keep up with the rate at which garbage is generated, a stop-the-world collection occurs
Stop-the-world CMS can take a very long time because the old gen is usually big
G1 is a newer garbage collector that can optionally be used with Cassandra
G1 divides the heap into multiple young and tenured regions and collects different regions independently
Recent tests have shown that G1 provides lower latency and throughput with less tuning than the older GC options
G1 will be the default garbage collector starting in Cassandra 3.0.
Flapping is often caused by garbage collections
The nodes go up-down-up-down, repeatedly
That’s why it’s called flapping
Notice the nodes that are flapping
If a single node is flapping, check the logs on that node during the same timeframe and see if GC is occurring
If multiple nodes are are reported up and down, the local node may be the problem
If a node is doing GC it won’t be able to receive gossip messages from another node and may think they’re down
Check for GC messages in the local log around the time that flapping occurs
Note the time that flapping occurs
If it happens infrequently, it may not be a problem
If it happens multiple times a minute, it is a problem
Check other node’s logs for GC events that occurred at the same time
If you don’t see anything in the other node’s log, it may be a network issue
You can reduce the failure detector’s sensitivity by increasing phi_convict_threshold in cassandra.yaml
Default value is 8; maximum recommended value is 12 (useful on high latency networks such as AWS)
If a write comes in for a down node, the coordinator will store hints for it
When the node comes back up, the nodes that have hints for it will send them
Hints are no longer stored after the node has been down over period of time specified in cassandra.yaml.
This is to prevent a node that comes back up from being inundated with more hints than it can handle
Any node that has been down longer than this period of time needs to run nodetool repair
Flapping can cause excessive hint buildup, which adds extra burden for both the coordinator and the node that is flapping
This can lead to cascading failures
Repairs are initiated by running nodetool repair
They do a full comparison of all the data for a particular token range with the other replicas for that range, then exchange any data that is out of sync
system.log shows the process from start to finish
Note the UUID for the repair session. This is your key to correlating the various messages.
When a repair session begins, you will see a “new session” message
It will report the nodes it’s going to sync with
The token ranges it’s going to sync
And the keyspace and column families it’s going to sync
The first step is for the repair leader to request merkle trees from all the other replicas
A message is logged reporting the receipt of each requested merkle tree
Make sure you see a message that the merkle tree was received from each node that it was requested from
After comparing the merkle trees, if the nodes are in sync, you’ll see a message like this
If not, you’ll see a message like this, reporting how many ranges are out of sync
The node will then begin a streaming repair with the out-of-sync replica
Another message reports when the streaming task has succeeded
A message will report when each table has been fully synced. This means either it was in sync to begin with, or all the streaming tasks necessary to sync it completed.
Once all tables are synced, a message will report that the overall repair session completed successfully.
If you see a “new session” message for a particular ID but not a “session completed successfully”, the repair is still running.
If a repair doesn’t complete successfully after some time, you should look more closely at the other messages for that session to see where it might be stuck.
Sometimes network issues can disrupt the streaming of data or a merkle tree, causing repair to hang
Other times, there is simply a lot of data, and building merkle trees can take a long time, as can streaming data
Increasing compaction throughput and streaming throughput will help speed the process, at the cost of using extra I/O and network bandwidth.
Check the other nodes involved in the repair for messages using the same session ID
Check for any errors that would have disrupted the repair
Before we end, I just want to go back to the troubleshooting process I discussed at the beginning
Next time you have a problem, think about the tools at your disposal and how they can help you with these steps