Cluster Management Operations¶

This section covers operational procedures for managing Cassandra cluster topology: adding capacity, removing nodes, replacing failed hardware, and managing multi-datacenter deployments.

Architecture Reference

For conceptual details on how Cassandra manages cluster membership, see Cluster Management Architecture.

Operations Reference¶

Operation	Use Case	Command	Impact
Add Node	Expand capacity	Start node with `auto_bootstrap: true`	Streaming to new node
Decommission	Graceful removal (node up)	`nodetool decommission`	Streaming from departing node
Remove Node	Forced removal (node down)	`nodetool removenode`	Streaming among remaining nodes
Replace Node	Hardware replacement	`replace_address_first_boot`	Streaming to replacement
Scale Up	Add multiple nodes	Sequential bootstrap	Multiple streaming operations
Scale Down	Reduce capacity	Sequential decommission	Multiple streaming operations
Add Datacenter	Geographic expansion	Rebuild from existing DC	Cross-DC streaming
Remove Datacenter	Consolidation	Decommission all DC nodes	Data redistribution
Cleanup	Reclaim space after adding	`nodetool cleanup`	I/O intensive
Assassinate	Last resort removal	`nodetool assassinate`	Immediate, no streaming

Safety Requirements¶

The following requirements must be observed for all topology operations:

Pre-Operation Requirements¶

Requirement	Rationale
All nodes should show `UN` status	Topology changes with degraded nodes increase risk; proceed with caution if necessary
No pending repairs must be running	Concurrent operations cause unpredictable behavior
No other topology changes must be in progress	Only one topology change may occur at a time
Sufficient disk headroom must exist	Streaming requires temporary additional space
Schema agreement must be confirmed	Schema disagreement causes bootstrap failures

# Pre-flight verification
nodetool status              # All nodes UN
nodetool describecluster     # Single schema version
nodetool compactionstats     # No heavy compaction
nodetool netstats            # No active streaming

Operational Constraints¶

Critical Constraints

One operation at a time: Multiple concurrent topology changes must not be performed. The cluster must complete one operation before starting another.

Avoid bootstrapping multiple nodes simultaneously to reduce streaming load. Concurrent bootstraps are supported but increase resource contention.

Cleanup must run after adding nodes. Existing nodes retain data they no longer own until cleanup executes.

Operation Selection Guide¶

Node is Healthy and Accessible¶

Goal	Operation
Remove node permanently	Decommission
Move node to different hardware	Decommission → Add Node
Replace with same IP	Replace Node (faster)

Node is Down or Unresponsive¶

Scenario	Operation
Node recoverable (disk/network issue)	Fix issue, node rejoins automatically
Node unrecoverable, data on other replicas	Remove Node
Node unrecoverable, need same token range	Replace Node
Remove stuck in gossip	Assassinate (last resort)

Capacity Planning¶

Goal	Operation
Increase capacity	Scale Up (add nodes)
Decrease capacity	Scale Down (decommission nodes)
Add geographic redundancy	Add Datacenter
Consolidate datacenters	Remove Datacenter

Streaming Behavior¶

All topology changes except assassinate involve data streaming between nodes.

Streaming Direction by Operation¶

Operation	Data Flows From	Data Flows To
Add node	Existing nodes	New node
Decommission	Departing node	Remaining nodes
Remove node	Remaining replicas	Other replicas
Replace node	Remaining replicas	Replacement node
Rebuild	Source datacenter	Target datacenter

Estimated Duration¶

Streaming duration depends on data volume and network bandwidth:

Data Volume	1 Gbps Network	10 Gbps Network
100 GB	15-30 min	5-10 min
500 GB	1-2 hours	15-30 min
1 TB	2-4 hours	30-60 min
5 TB	12-24 hours	2-4 hours

Streaming Throughput

Default streaming throughput is 200 Mbps. This may be increased for faster operations:

nodetool setstreamthroughput 400  # MB/s

Higher values increase operation speed but may impact client request latency.

Monitoring Topology Operations¶

Key Commands¶

# Cluster membership status
nodetool status

# Streaming progress
nodetool netstats

# Detailed streaming sessions
nodetool netstats -H

# Node state transitions
nodetool gossipinfo | grep STATUS

Node States During Operations¶

State	Code	Meaning
Normal	`UN`	Fully operational
Joining	`UJ`	Bootstrap in progress
Leaving	`UL`	Decommission in progress
Moving	`UM`	Token move in progress
Down	`DN`	Node unreachable

JMX Metrics¶

# Streaming progress
org.apache.cassandra.metrics:type=Streaming,scope=*,name=*

# Compaction (impacts streaming)
org.apache.cassandra.metrics:type=Compaction,name=PendingTasks

Failure Handling¶

Operation Interrupted¶

Operation	If Interrupted	Recovery
Bootstrap	Node partially populated	Clear data, restart bootstrap
Decommission	Node partially drained	Cannot resume; complete manually or restore
Remove	Partial redistribution	Re-run removenode
Replace	Replacement partially populated	Clear data, restart replacement

Streaming Failures¶

If streaming fails during an operation:

Check network connectivity between nodes
Verify disk space on source and target
Review system.log for specific errors
Increase streaming_socket_timeout_in_ms if timeouts occur

# cassandra.yaml - increase for large partitions
streaming_socket_timeout_in_ms: 86400000  # 24 hours

Procedures¶

Adding Nodes - Bootstrap new nodes into the cluster
Removing Nodes - Decommission, removenode, and assassinate
Replacing Nodes - Replace failed nodes with new hardware
Scaling Operations - Scale cluster capacity up or down
Multi-Datacenter Operations - Add and remove datacenters
Cleanup Operations - Post-topology-change maintenance
Troubleshooting - Diagnose and resolve topology issues

Gossip Protocol - How nodes discover cluster state
Node Lifecycle - Node states and transitions
Repair Operations - Post-topology repair procedures
Backup & Restore - Data protection during changes