Skip to content
Maintained by AxonOps — production-grade documentation from engineers who operate distributed databases at scale Simplify Cassandra Ops Simplify Kafka Ops

Scheduling nodetool Commands

nodetool operates on a single node. In production Cassandra deployments spanning multiple nodes, datacenters, and racks, most operational tasks require executing nodetool commands across every node in a coordinated sequence. Manual orchestration introduces significant operational risk.


The Orchestration Challenge

Single-Node Limitation

nodetool connects to one Cassandra node at a time via JMX. A command like nodetool repair -pr only affects the node where it is executed. Achieving cluster-wide coverage requires repeating the command on every node — in the correct order, with appropriate health verification between steps.

What Can Go Wrong

Risk Consequence
Executing on too many nodes simultaneously Loss of quorum availability; potential data unavailability
Skipping a node Incomplete repair; risk of data resurrection from missed tombstone propagation
No health verification between steps Cascading failures if a node does not recover before proceeding
Ignoring rack/datacenter topology Multiple replicas for the same data taken offline simultaneously
No error handling Silent failures leave the cluster in an inconsistent maintenance state
No audit trail Impossible to verify what ran, when, and whether it succeeded
Script interruption No way to resume from where the operation stopped

Coordination Requirements Vary by Operation

Different nodetool commands have different safety constraints:

Operation Constraint
repair -pr Must run on every node within gc_grace_seconds; may parallelize on non-overlapping ranges
cleanup Must run on every existing node after topology change; sequential recommended
compact Avoid concurrent execution on nodes sharing replicas
upgradesstables Must run on every node after version upgrade; sequential
drain + restart Rolling: one node at a time with health gates between steps
decommission Single node; verify cluster health before and after
flush Safe to run concurrently
setcompactionthroughput Apply to all nodes; safe to run concurrently

Multi-Datacenter Complexity

Clusters spanning multiple datacenters introduce additional constraints:

  • Some operations should complete in the local datacenter before proceeding to remote datacenters
  • Rack awareness is required to avoid taking down multiple replicas for the same token range
  • Different datacenters may have different maintenance windows
  • Coordination state must persist across long-running operations that span hours or days

Orchestration with AxonOps

AxonOps Operations provides purpose-built orchestration for nodetool commands across Cassandra clusters, with native understanding of Cassandra topology, health, and operational constraints.

Topology-Aware Execution

AxonOps understands the cluster topology — nodes, racks, datacenters, and token ownership — and uses this to coordinate command execution safely:

  • Rack-aware rolling: Only one node per rack is affected at a time, maintaining quorum availability
  • Datacenter ordering: Operations complete in one datacenter before proceeding to the next
  • Token-range awareness: For operations like repair, coordination is based on token ownership to avoid redundant work

Health-Gated Execution

Each step in a rolling operation is gated by cluster health checks:

  • Verify all expected nodes are in UN (Up/Normal) state before proceeding
  • Check that pending compaction backlog is below threshold
  • Monitor streaming activity from previous steps
  • Configurable stabilization wait time between nodes

Scheduling and Automation

Operations can be scheduled to run automatically:

  • Recurring schedules: Repair cycles, routine maintenance
  • Maintenance windows: Restrict execution to off-peak hours
  • Dependency chains: Flush before snapshot, repair after topology change
  • Adaptive timing: Adjust execution speed based on cluster load

Progress Tracking and Audit

  • Real-time progress visibility across all nodes
  • Persistent state — if interrupted, operations resume from where they stopped
  • Complete audit log of every command executed, on which node, with output and exit status
  • Alerting on failures with configurable retry policies

Supported Operations

Operation AxonOps Orchestration
Repair Adaptive scheduling with gc_grace_seconds compliance
Rolling restart Drain, restart, health-gate per node with rack awareness
Cleanup Triggered automatically after topology changes
Compaction tuning Apply throughput changes across all nodes
Schema changes Verify schema agreement after each change
Upgrades Coordinated rolling upgrade with version verification