Cluster Management Operations¶
This section covers operational procedures for managing Cassandra cluster topology: adding capacity, removing nodes, replacing failed hardware, and managing multi-datacenter deployments.
Architecture Reference
For conceptual details on how Cassandra manages cluster membership, see Cluster Management Architecture.
Operations Reference¶
| Operation | Use Case | Command | Impact |
|---|---|---|---|
| Add Node | Expand capacity | Start node with auto_bootstrap: true |
Streaming to new node |
| Decommission | Graceful removal (node up) | nodetool decommission |
Streaming from departing node |
| Remove Node | Forced removal (node down) | nodetool removenode |
Streaming among remaining nodes |
| Replace Node | Hardware replacement | replace_address_first_boot |
Streaming to replacement |
| Scale Up | Add multiple nodes | Sequential bootstrap | Multiple streaming operations |
| Scale Down | Reduce capacity | Sequential decommission | Multiple streaming operations |
| Add Datacenter | Geographic expansion | Rebuild from existing DC | Cross-DC streaming |
| Remove Datacenter | Consolidation | Decommission all DC nodes | Data redistribution |
| Cleanup | Reclaim space after adding | nodetool cleanup |
I/O intensive |
| Assassinate | Last resort removal | nodetool assassinate |
Immediate, no streaming |
Safety Requirements¶
The following requirements must be observed for all topology operations:
Pre-Operation Requirements¶
| Requirement | Rationale |
|---|---|
All nodes must show UN status |
Topology changes with degraded nodes risk data loss |
| No pending repairs must be running | Concurrent operations cause unpredictable behavior |
| No other topology changes must be in progress | Only one topology change may occur at a time |
| Sufficient disk headroom must exist | Streaming requires temporary additional space |
| Schema agreement must be confirmed | Schema disagreement causes bootstrap failures |
# Pre-flight verification
nodetool status # All nodes UN
nodetool describecluster # Single schema version
nodetool compactionstats # No heavy compaction
nodetool netstats # No active streaming
Operational Constraints¶
Critical Constraints
One operation at a time: Multiple concurrent topology changes must not be performed. The cluster must complete one operation before starting another.
Never bootstrap multiple nodes simultaneously unless using manual token assignment. Concurrent bootstraps with vnodes cause token collisions.
Cleanup must run after adding nodes. Existing nodes retain data they no longer own until cleanup executes.
Operation Selection Guide¶
Node is Healthy and Accessible¶
| Goal | Operation |
|---|---|
| Remove node permanently | Decommission |
| Move node to different hardware | Decommission → Add Node |
| Replace with same IP | Replace Node (faster) |
Node is Down or Unresponsive¶
| Scenario | Operation |
|---|---|
| Node recoverable (disk/network issue) | Fix issue, node rejoins automatically |
| Node unrecoverable, data on other replicas | Remove Node |
| Node unrecoverable, need same token range | Replace Node |
| Remove stuck in gossip | Assassinate (last resort) |
Capacity Planning¶
| Goal | Operation |
|---|---|
| Increase capacity | Scale Up (add nodes) |
| Decrease capacity | Scale Down (decommission nodes) |
| Add geographic redundancy | Add Datacenter |
| Consolidate datacenters | Remove Datacenter |
Streaming Behavior¶
All topology changes except assassinate involve data streaming between nodes.
Streaming Direction by Operation¶
| Operation | Data Flows From | Data Flows To |
|---|---|---|
| Add node | Existing nodes | New node |
| Decommission | Departing node | Remaining nodes |
| Remove node | Remaining replicas | Other replicas |
| Replace node | Remaining replicas | Replacement node |
| Rebuild | Source datacenter | Target datacenter |
Estimated Duration¶
Streaming duration depends on data volume and network bandwidth:
| Data Volume | 1 Gbps Network | 10 Gbps Network |
|---|---|---|
| 100 GB | 15-30 min | 5-10 min |
| 500 GB | 1-2 hours | 15-30 min |
| 1 TB | 2-4 hours | 30-60 min |
| 5 TB | 12-24 hours | 2-4 hours |
Streaming Throughput
Default streaming throughput is 200 Mbps. This may be increased for faster operations:
nodetool setstreamthroughput 400 # MB/s
Monitoring Topology Operations¶
Key Commands¶
# Cluster membership status
nodetool status
# Streaming progress
nodetool netstats
# Detailed streaming sessions
nodetool netstats -H
# Node state transitions
nodetool gossipinfo | grep STATUS
Node States During Operations¶
| State | Code | Meaning |
|---|---|---|
| Normal | UN |
Fully operational |
| Joining | UJ |
Bootstrap in progress |
| Leaving | UL |
Decommission in progress |
| Moving | UM |
Token move in progress |
| Down | DN |
Node unreachable |
JMX Metrics¶
# Streaming progress
org.apache.cassandra.metrics:type=Streaming,scope=*,name=*
# Compaction (impacts streaming)
org.apache.cassandra.metrics:type=Compaction,name=PendingTasks
Failure Handling¶
Operation Interrupted¶
| Operation | If Interrupted | Recovery |
|---|---|---|
| Bootstrap | Node partially populated | Clear data, restart bootstrap |
| Decommission | Node partially drained | Cannot resume; complete manually or restore |
| Remove | Partial redistribution | Re-run removenode |
| Replace | Replacement partially populated | Clear data, restart replacement |
Streaming Failures¶
If streaming fails during an operation:
- Check network connectivity between nodes
- Verify disk space on source and target
- Review
system.logfor specific errors - Increase
streaming_socket_timeout_in_msif timeouts occur
# cassandra.yaml - increase for large partitions
streaming_socket_timeout_in_ms: 86400000 # 24 hours
Procedures¶
- Adding Nodes - Bootstrap new nodes into the cluster
- Removing Nodes - Decommission, removenode, and assassinate
- Replacing Nodes - Replace failed nodes with new hardware
- Scaling Operations - Scale cluster capacity up or down
- Multi-Datacenter Operations - Add and remove datacenters
- Cleanup Operations - Post-topology-change maintenance
- Troubleshooting - Diagnose and resolve topology issues
Related Documentation¶
- Gossip Protocol - How nodes discover cluster state
- Node Lifecycle - Node states and transitions
- Repair Operations - Post-topology repair procedures
- Backup & Restore - Data protection during changes