Cluster Management Troubleshooting¶

This guide covers diagnosis and resolution of common issues during topology operations.

Diagnostic Commands¶

Essential Commands¶

# Cluster state
nodetool status

# Streaming status
nodetool netstats

# Schema agreement
nodetool describecluster

# Gossip state
nodetool gossipinfo

# Active operations
nodetool compactionstats

Log Analysis¶

# Recent errors
grep -i "error\|exception\|failed" /var/log/cassandra/system.log | tail -50

# Streaming issues
grep -i stream /var/log/cassandra/system.log | tail -50

# Bootstrap/decommission
grep -i "bootstrap\|decommission\|leaving\|joining" /var/log/cassandra/system.log | tail -50

Bootstrap Issues¶

Node Won't Join Cluster¶

Symptoms: New node starts but doesn't appear in nodetool status

Diagnostic steps:

# Check if Cassandra is running
sudo systemctl status cassandra

# Check for startup errors
tail -100 /var/log/cassandra/system.log | grep -i error

# Verify network connectivity to seeds
nc -zv <seed_ip> 7000
nc -zv <seed_ip> 9042

Common causes and solutions:

Cause	Diagnosis	Solution
Cluster name mismatch	`grep cluster_name cassandra.yaml`	Fix name, clear data, restart
Seeds unreachable	`nc -zv seed 7000` fails	Check firewall, network
Wrong listen_address	Check logs for binding errors	Fix `listen_address` in yaml
Data directory not empty	Check `/var/lib/cassandra/data`	Clear data directories
Schema disagreement	`nodetool describecluster` shows multiple versions	Wait or restart seeds

# Clear data for fresh start
sudo systemctl stop cassandra
sudo rm -rf /var/lib/cassandra/data/*
sudo rm -rf /var/lib/cassandra/commitlog/*
sudo rm -rf /var/lib/cassandra/saved_caches/*
sudo systemctl start cassandra

Bootstrap Stalled¶

Symptoms: Node shows UJ (Joining) for extended period, streaming shows no progress

Diagnostic steps:

# Check streaming progress
nodetool netstats

# Look for streaming errors
grep -i "stream.*error\|stream.*failed" /var/log/cassandra/system.log

# Check source nodes
nodetool status

Common causes and solutions:

Cause	Diagnosis	Solution
Source node overloaded	High CPU/IO on source	Wait or reduce stream throughput
Network issues	Packet loss, timeouts in logs	Fix network
Large partitions	Timeout errors in logs	Increase `streaming_socket_timeout_in_ms`
Disk full	Check `df -h`	Free space

# cassandra.yaml - increase timeouts for large partitions
# 4.0: streaming_socket_timeout_in_ms: 86400000
# 4.1+: streaming_socket_timeout: 24h (duration syntax)
streaming_socket_timeout_in_ms: 86400000  # 24 hours (4.0 syntax)

Version Differences

Cassandra 4.1+ uses duration syntax (e.g., 24h) instead of _in_ms suffixes. Check the version-specific configuration documentation.

Bootstrap Failed Mid-Way¶

Symptoms: Node crashed or was stopped during bootstrap

Recovery:

# Clear partial data and retry
sudo systemctl stop cassandra
sudo rm -rf /var/lib/cassandra/data/*
sudo rm -rf /var/lib/cassandra/commitlog/*
sudo rm -rf /var/lib/cassandra/saved_caches/*
sudo systemctl start cassandra

Decommission Issues¶

Decommission Stuck¶

Symptoms: Node remains in UL (Leaving) state for extended period

Diagnostic steps:

# Check streaming progress
nodetool netstats

# Look for target node issues
nodetool status

# Check logs
grep -i "decommission\|stream" /var/log/cassandra/system.log | tail -50

Common causes and solutions:

Cause	Diagnosis	Solution
Target nodes unavailable	`nodetool status` shows DN	Fix target nodes
Network issues	Streaming errors in logs	Fix network
Target disk full	Check target `df -h`	Free space on targets
Streaming throttled	Low throughput in `netstats`	Increase stream throughput

Cannot Cancel Decommission

Once decommission starts, it must complete. If truly stuck:

Try waiting longer (decommission can take many hours)
Fix underlying issues (network, disk, node health)
As last resort: stop the node and use nodetool removenode from another node

Decommission Interrupted¶

Symptoms: Decommissioning node was stopped or crashed

Recovery options:

Scenario	Solution
Node can be restarted	Restart; decommission should resume
Node cannot be restarted	Use `nodetool removenode` from other node
Data partially streamed	Run repair after recovery

Removenode Issues¶

Removenode Not Progressing¶

Symptoms: nodetool removenode status shows no progress

Diagnostic steps:

# Check removenode status
nodetool removenode status

# Check streaming
nodetool netstats

# Check logs on executing node
grep -i "remove\|stream" /var/log/cassandra/system.log | tail -50

Solutions:

# If stuck for > 1 hour with no progress
nodetool removenode force <host_id>

# After force removal, run repair
nodetool repair -full

Wrong Node Removed¶

Symptoms: Accidentally removed wrong node

Recovery:

If node still has data: restart it (will try to rejoin)
If data cleared: add as new node (bootstrap)
Run repair to ensure consistency

Replacement Issues¶

Replacement Won't Start¶

Symptoms: Node with replace_address_first_boot won't start

Diagnostic steps:

# Check for startup errors
tail -100 /var/log/cassandra/system.log | grep -i error

# Verify dead node is recognized
nodetool status  # Should show DN for dead node

Common causes and solutions:

Cause	Diagnosis	Solution
Dead node not recognized	Not in `nodetool status` as DN	Wait for gossip timeout
Wrong IP in replace option	IP mismatch	Correct the JVM option
Data directory not empty	Has old data	Clear data directories
Version mismatch	Check versions	Install matching version

Replacement Streaming Stalled¶

Symptoms: Replacement node stuck in UJ state

Same solutions as bootstrap stalled (see above)

Replacement Node Has Wrong Tokens¶

Symptoms: After replacement, node has different token count

Cause: num_tokens doesn't match dead node

Solution:

# Must restart with correct configuration
sudo systemctl stop cassandra
sudo rm -rf /var/lib/cassandra/data/*

# Fix cassandra.yaml
num_tokens: <match_dead_node>

# Restart
sudo systemctl start cassandra

Streaming Issues¶

Streaming Timeouts¶

Symptoms: Repeated timeout errors during any topology operation

# Typical error in logs
ERROR [Stream...] stream/StreamResultFuture.java:...
  Stream failed: java.net.SocketTimeoutException: Read timed out

Solutions:

# cassandra.yaml - increase timeouts
streaming_socket_timeout_in_ms: 86400000  # 24 hours (default: 1 hour)

# For Cassandra 4.0+
stream_entire_sstables: true  # Faster for large files

Streaming Too Slow¶

Symptoms: Topology operations taking excessive time

Diagnosis:

# Check current throughput
nodetool getstreamthroughput

# Check network utilization
iftop -i eth0  # or appropriate interface

Solutions:

# Increase streaming throughput (MB/s)
nodetool setstreamthroughput 400  # Default is 200

# Cassandra 4.0+ in cassandra.yaml
stream_entire_sstables: true

Streaming Failures¶

Symptoms: Repeated streaming failures

Diagnostic steps:

# Check for failures
nodetool netstats | grep -i failed

# Check specific errors
grep -i "stream.*failed\|stream.*error" /var/log/cassandra/system.log

Common causes:

Cause	Solution
Network instability	Fix network issues
Disk I/O bottleneck	Reduce concurrent streaming
Memory pressure	Increase heap or reduce streaming
Firewall issues	Open port 7000 between all nodes

Gossip Issues¶

Node Stuck in Gossip¶

Symptoms: Node appears in nodetool status but shouldn't (already removed)

Cause: Gossip state not properly propagated

Solutions:

# Option 1: Assassinate the stuck node
nodetool assassinate <stuck_node_ip>

# Option 2: If that fails, rolling restart of cluster
# Start with seeds, then other nodes

Schema Disagreement¶

Symptoms: nodetool describecluster shows multiple schema versions

Diagnostic:

nodetool describecluster

# Example problematic output:
# Schema versions:
#     abc-123: [10.0.1.1, 10.0.1.2]
#     def-456: [10.0.1.3]  <-- Different!

Solutions:

Wait (schema should converge within minutes)
If persists, restart the disagreeing node
If still persists, restart seeds

Network Issues¶

Port Connectivity Problems¶

Verification:

# From each node, verify connectivity to all others
for ip in 10.0.1.1 10.0.1.2 10.0.1.3; do
    nc -zv $ip 7000 && echo "$ip:7000 OK" || echo "$ip:7000 FAILED"
    nc -zv $ip 9042 && echo "$ip:9042 OK" || echo "$ip:9042 FAILED"
done

Required ports:

Port	Purpose	Required Between
7000	Internode	All nodes
7001	Internode SSL	All nodes (if SSL)
9042	Native transport	Clients and nodes
7199	JMX	Admin hosts

Cross-DC Connectivity¶

For multi-DC setups:

# Verify cross-DC latency
ping <other_dc_node>

# Should be < 100ms for reasonable performance
# Higher latency impacts streaming and consistency

Recovery Procedures¶

Node Completely Unrecoverable¶

If a node cannot be recovered and removenode fails:

# 1. Try removenode
nodetool removenode <host_id>

# 2. If stuck, force
nodetool removenode force <host_id>

# 3. If still stuck, assassinate
nodetool assassinate <node_ip>

# 4. Run full repair
nodetool repair -full

Cluster Partition (Split Brain)¶

Symptoms: Nodes in different groups can't see each other

Immediate actions:

Stop writes if possible
Identify the partition cause (network, firewall)
Restore connectivity
Run full repair

# After connectivity restored
nodetool repair -full

Cluster Management Overview - Operation selection
Adding Nodes - Bootstrap procedures
Removing Nodes - Removal procedures
Replacing Nodes - Replacement procedures
Repair Operations - Post-issue repair

Cluster Management Troubleshooting¶

Diagnostic Commands¶

Essential Commands¶

Log Analysis¶

Bootstrap Issues¶

Node Won't Join Cluster¶

Bootstrap Stalled¶

Bootstrap Failed Mid-Way¶

Decommission Issues¶

Decommission Stuck¶

Decommission Interrupted¶

Removenode Issues¶

Removenode Not Progressing¶

Wrong Node Removed¶

Replacement Issues¶

Replacement Won't Start¶

Replacement Streaming Stalled¶

Replacement Node Has Wrong Tokens¶

Streaming Issues¶

Streaming Timeouts¶

Streaming Too Slow¶

Streaming Failures¶

Gossip Issues¶

Node Stuck in Gossip¶

Schema Disagreement¶

Network Issues¶

Port Connectivity Problems¶

Cross-DC Connectivity¶

Recovery Procedures¶

Node Completely Unrecoverable¶

Cluster Partition (Split Brain)¶

Related Documentation¶