What is Repair¶
Repair is the process by which Apache Cassandra detects and resolves data inconsistencies between replicas. In a distributed system where writes can arrive at different replicas at different times — or not at all, due to network partitions or node failures — replicas can diverge. Repair systematically compares data across replicas and streams any differences to restore convergence.
Repair is also known as anti-entropy repair in distributed systems literature. The term "anti-entropy" originates from information theory: entropy measures disorder, and these processes reduce disorder by ensuring all replicas eventually hold identical data.
For operational procedures on running and scheduling repair, see the Repair Operations Guide.
Why Repair Is Necessary¶
Cassandra provides two opportunistic synchronization mechanisms — hinted handoff and read reconciliation — that handle common cases of replica divergence. However, these mechanisms are insufficient in several scenarios:
| Scenario | Why Opportunistic Mechanisms Fail |
|---|---|
| Node down longer than hint window | Hints expire after max_hint_window (default 3 hours); writes during the remainder of the outage are never delivered to that replica |
| Data never read | Read reconciliation only affects data that is queried; cold data remains divergent indefinitely |
| Deleted data (tombstones) | Tombstones must propagate to all replicas before gc_grace_seconds expires, or deleted data can resurrect |
| Schema changes during outage | Hints do not handle DDL changes |
| Coordinator failure after hint storage | If the coordinator storing a hint fails permanently before delivery, the hint is lost |
Repair addresses all of these by performing a systematic comparison of data across replicas, independent of read or write activity.
Simplifying Repair Operations
Managing repair scheduling, concurrency tuning, and gc_grace_seconds compliance across large clusters is operationally demanding. AxonOps Adaptive Repair automates repair scheduling with load-aware throttling, automatic failure retry, and compliance monitoring — eliminating the manual coordination described in this page.
Repair Coordination Protocol¶
Repair is a coordinated, multi-phase process involving a coordinator node and all replica nodes that own the token ranges being repaired.
Phase 1: Session Setup¶
The coordinator node (where nodetool repair is executed) initiates a repair session:
- Determines which token ranges to repair based on the keyspace's replication strategy and the coordinator's token ownership
- Identifies all replica nodes for those token ranges
- Opens a repair session with each participating replica
- Assigns a unique session ID for tracking
Phase 2: Validation (Merkle Tree Construction)¶
Each participating replica independently builds a Merkle tree for the requested token ranges:
- A validation compaction reads all relevant SSTables for the token range
- Data is hashed at the leaf level — each leaf corresponds to a partition or range of partitions
- Parent hashes are computed by combining child hashes up to the root
- The completed Merkle tree is sent to the coordinator
Validation compaction appears in nodetool compactionstats as a "Validation" type compaction.
Phase 3: Comparison¶
The coordinator receives Merkle trees from all replicas and compares them pairwise:
- Root hashes are compared first — if they match, the replicas are consistent for that range and no further work is needed
- On mismatch, the coordinator recursively descends into child nodes to identify exactly which segments differ
- This tree-walking process requires O(log n) comparisons to identify differing segments, rather than comparing every record
Phase 4: Streaming¶
For each pair of replicas with identified differences:
- The coordinator instructs the replica with newer data to stream the differing segments to the replica with stale data
- Streaming uses Cassandra's built-in streaming protocol
- The receiving replica incorporates the streamed data
Phase 5: Cleanup¶
After streaming completes:
- In incremental repair, an anti-compaction step separates repaired and unrepaired data (see Anti-Compaction)
- The repair session is marked as complete
- Session status is recorded in
system_distributed.repair_history(Cassandra 4.0+)
The Merkle Tree¶
Merkle trees, introduced by Ralph Merkle (Merkle, R., 1987, "A Digital Signature Based on a Conventional Encryption Function"), enable efficient comparison of large datasets by hierarchically hashing data segments.
Structure¶
A Merkle tree is a binary tree where:
- Leaf nodes contain hashes of individual data segments (partitions or partition ranges)
- Internal nodes contain hashes computed from their children
- The root hash represents the entire dataset
Efficiency¶
When two replicas have identical data, only the root hashes are compared — a single comparison confirms consistency. When differences exist, the tree structure narrows the search:
| Dataset size | Full comparison | Merkle tree comparison |
|---|---|---|
| 1 million partitions | 1,000,000 comparisons | ~20 comparisons (log₂) |
| 1 billion partitions | 1,000,000,000 comparisons | ~30 comparisons (log₂) |
Tree Depth and Memory¶
The Merkle tree depth determines both precision and memory consumption:
| Parameter | Description |
|---|---|
cassandra.repair_session_max_tree_depth |
Maximum tree depth (JVM property) |
Deeper trees provide finer-grained difference detection (streaming smaller segments), but consume more memory. Shallower trees use less memory but may stream more data than strictly necessary when differences are found.
In Cassandra 4.0+, the default tree depth is dynamically calculated based on the data size of the token range being repaired, with a configurable maximum.
Full Repair vs Incremental Repair¶
Full Repair¶
Full repair compares all data on each replica for the requested token ranges, regardless of whether it has been previously repaired.
- Builds Merkle trees over the entire dataset
- Streams all differing data
- Does not modify SSTable metadata
- Required after topology changes, node replacements, or when incremental repair state is inconsistent
Incremental Repair¶
Incremental repair, introduced in Cassandra 2.1 and significantly improved in Cassandra 4.0, compares only data that has been written since the last successful repair.
How It Works¶
Cassandra tracks whether each SSTable has been repaired:
- Unrepaired SSTables contain data that has not yet been validated across replicas
- When incremental repair runs, only unrepaired SSTables participate in Merkle tree construction
- After successful repair, participating SSTables are marked as repaired via anti-compaction
- Subsequent incremental repairs skip already-repaired SSTables
This approach reduces repair overhead because only new data needs validation.
Repaired vs Unrepaired SSTables¶
| State | Meaning | Participates in Incremental Repair |
|---|---|---|
| Unrepaired | Data not yet validated across replicas | Yes |
| Repaired | Data confirmed consistent across replicas | No |
| Pending | Data currently being repaired (4.0+) | No |
The pending state was introduced in Cassandra 4.0 to address a race condition in earlier versions where data written during repair could be incorrectly marked as repaired.
# Check percent repaired for a table
nodetool tablestats my_keyspace.my_table | grep "Percent repaired"
Incremental Repair Improvements in 4.0¶
Prior to Cassandra 4.0, incremental repair had several known issues:
- Race condition: Data written during repair could be incorrectly marked as repaired without validation
- Anti-compaction overhead: Every repair triggered anti-compaction on all SSTables
- Inconsistent state: Failed repairs could leave SSTables in an inconsistent repaired/unrepaired state
Cassandra 4.0 addressed these with:
- Pending state: SSTables are marked as "pending" during repair and only promoted to "repaired" after successful completion
- Transient replication awareness: Repair correctly handles transient replicas (if configured)
- Preview repair: Allows checking for inconsistencies without actually repairing
In Cassandra 4.0+, incremental repair is the default mode. Full repair requires the explicit -full flag.
Anti-Compaction¶
Anti-compaction is a process that runs after incremental repair to separate repaired data from unrepaired data within the same SSTable.
Why Anti-Compaction Is Needed¶
An SSTable may contain a mix of:
- Data that was included in the repair session (now confirmed consistent)
- Data that was written after the repair session started (not yet validated)
Anti-compaction splits these SSTables so that repaired and unrepaired data reside in separate SSTables. This allows future incremental repairs to skip the repaired SSTables entirely.
Process¶
- For each SSTable that participated in repair, determine which token ranges were repaired
- Split the SSTable: data in repaired ranges goes to a new SSTable marked as repaired; data outside repaired ranges goes to a new SSTable marked as unrepaired
- Remove the original SSTable
Anti-compaction appears in nodetool compactionstats as an "AntiCompaction" type.
Performance Implications¶
Anti-compaction adds I/O overhead after every incremental repair:
- Each participating SSTable must be read and rewritten
- Temporary disk space is needed for the split SSTables
- This overhead is proportional to the amount of unrepaired data
Primary Range Repair¶
By default, nodetool repair repairs all token ranges that a node holds replicas for — including ranges where the node is a secondary or tertiary replica. This means the same token range may be repaired multiple times if repair is run on every node in the cluster.
Primary range repair (-pr flag) restricts repair to only the token ranges for which the node is the primary (first) replica. When -pr is used on every node in the cluster, each token range is repaired exactly once.
| Mode | Token ranges repaired | Redundancy |
|---|---|---|
Default (no -pr) |
All ranges the node holds replicas for | Significant — same range repaired from multiple nodes |
Primary range (-pr) |
Only ranges where node is primary replica | None — each range repaired exactly once |
Recommendation
For routine maintenance, use -pr on every node. This achieves full coverage with minimal redundant work.
The gc_grace_seconds Constraint¶
Tombstones (deletion markers) have a finite lifespan defined by gc_grace_seconds. This creates a critical constraint on repair frequency.
The Data Resurrection Problem¶
gc_grace_seconds defines tombstone retention period.
Default: 864000 (10 days)
Note: In Cassandra 4.1+, this can also be specified as a duration
(e.g., '10d').
CRITICAL CONSTRAINT:
Repair must complete on every node within gc_grace_seconds.
Failure scenario:
Day 0: Application deletes row on N1, N2 (tombstone created)
N3 is unavailable, does not receive tombstone
Day 11: Tombstones expire on N1, N2 (gc_grace = 10 days)
Compaction purges tombstones
Day 12: N3 returns to service
N3 retains the "deleted" row (no tombstone received)
Read reconciliation propagates N3's data to N1, N2
DELETED DATA REAPPEARS
Prevention: Complete repair cycle within gc_grace_seconds
Data Resurrection
Failure to complete repair within gc_grace_seconds can cause deleted data to reappear. This is one of the most common and serious operational issues in Cassandra deployments.
Why gc_grace_seconds Exists¶
The gc_grace_seconds parameter balances two competing concerns:
- Tombstone propagation: Tombstones must remain on disk long enough for repair to propagate them to all replicas
- Disk space reclamation: Tombstones consume disk space and degrade read performance; they should be purged once they are no longer needed
Setting gc_grace_seconds too low risks data resurrection. Setting it too high wastes disk space and increases read latency due to accumulated tombstones.
Repair Concurrency¶
Repair involves multiple levels of concurrency. Understanding these levels is essential for predicting cluster impact and tuning repair performance.
Repair Sessions, Commands, and Jobs¶
Repair work is organized in a hierarchy:
| Concept | Scope | Description |
|---|---|---|
| Repair command | One nodetool repair invocation |
Coordinates repair for all token ranges and tables in scope |
| Repair session | One token range, one table | A session validates and synchronizes a single table across replicas for a specific token range |
| Repair job | One column family within a session | The unit of Merkle tree validation and streaming |
A single nodetool repair command may create many repair sessions — one per (token range, table) combination. For example, repairing a keyspace with 5 tables on a node with 256 vnodes could create up to 1,280 sessions.
Parallelism Modes¶
The parallelism mode controls how repair sessions for different token ranges execute relative to each other. This is distinct from the number of concurrent sessions or jobs.
Sequential (-seq)¶
Token Range 1: [Validate] [Compare] [Stream] ────────────────────>
Token Range 2: [Validate] [Compare] [Stream] ──>
Token Range 3: [Validate]...
Each token range is repaired one at a time. Within a range, all replicas participate, but only one range is active at a time across the repair coordinator.
- Only one set of replicas is performing validation compaction at any given time
- Lowest cluster-wide I/O impact
- Longest total duration
- Suitable for clusters with limited I/O headroom or shared storage
Datacenter-Parallel (-dcpar)¶
Token ranges are repaired sequentially within each datacenter, but different datacenters can proceed in parallel.
- Replicas in different datacenters validate and stream concurrently
- Replicas in the same datacenter process ranges sequentially
- Good for multi-datacenter deployments where cross-DC bandwidth is the bottleneck
Parallel (default in 4.0+)¶
Token Range 1: [Validate] [Compare] [Stream] ──>
Token Range 2: [Validate] [Compare] [Stream] ──>
Token Range 3: [Validate] [Compare] [Stream] ──>
↑ All ranges processed concurrently
Multiple token ranges are repaired simultaneously. All replica pairs can validate and stream concurrently.
- Highest throughput — repairs complete fastest
- Highest cluster impact — multiple validation compactions and streams run at once
- Default mode in Cassandra 4.0+
Concurrent Validation Sessions¶
Within a single parallelism mode, the number of validation compactions that can run concurrently on each node is controlled by:
# cassandra.yaml (4.0+)
concurrent_validations: -1 # Default: -1 (auto: uses concurrent_compactors value)
| Value | Behavior |
|---|---|
-1 (default) |
Uses the value of concurrent_compactors |
0 |
Unlimited — all requested validations run immediately |
> 0 |
Explicit limit on concurrent validation compactions |
Validation compaction is CPU and I/O intensive (it reads all data in the token range and builds a Merkle tree). Limiting concurrent validations prevents repair from consuming all compaction capacity and starving normal compaction.
Table Concurrency (-j)¶
By default, repair processes tables within a keyspace one at a time. The -j flag controls how many tables are repaired concurrently:
# Repair 4 tables concurrently
nodetool repair -j 4 my_keyspace
-j value |
Behavior |
|---|---|
0 |
All tables repaired concurrently |
1 (default) |
Tables repaired sequentially |
n |
Up to n tables repaired concurrently |
Higher values reduce total repair time but increase resource consumption proportionally.
Repair Command Pool¶
The repair_command_pool_size controls how many repair command threads are available on each node:
-Dcassandra.repair_command_pool_size=<n>
This limits the total number of active repair tasks (validation + streaming) a node can participate in simultaneously, regardless of how many remote coordinators are requesting repair.
How Concurrency Levels Interact¶
In summary:
- Parallelism mode (
-seq,-dcpar, parallel) controls concurrency across token ranges -jflag controls concurrency across tables within each token rangeconcurrent_validationscontrols concurrency of Merkle tree builds on each noderepair_command_pool_sizecontrols the total repair thread pool on each node
Version History¶
| Version | Change |
|---|---|
| Pre-2.0 | Full repair only; sequential execution |
| 2.1 | Incremental repair introduced |
| 2.2 | Subrange repair (-st, -et) added |
| 4.0 | Incremental repair becomes default; preview repair added; pending SSTable state introduced; transient replication support (CASSANDRA-9143) |
| 4.1 | gc_grace_seconds accepts duration format (e.g., '10d') |
Resource Impact¶
| Resource | Cause | Mitigation |
|---|---|---|
| CPU | Merkle tree computation (hashing all data in range) | Schedule during low-traffic periods |
| Disk I/O | Validation compaction reads SSTables; anti-compaction rewrites them | Throttle with -Dcassandra.repair_command_pool_size |
| Network | Streaming divergent data between replicas | Throttle with nodetool setstreamthroughput |
| Memory | Merkle tree storage (proportional to tree depth) | Reduce with -Dcassandra.repair_session_max_tree_depth |
| Disk space | Anti-compaction temporarily requires space for split SSTables | Ensure adequate free space before repair |
Best Practices¶
| Practice | Rationale |
|---|---|
Complete repair cycle within gc_grace_seconds |
Prevents data resurrection |
| Use incremental repair for routine maintenance | Lower resource consumption than full repair |
| Use full repair after topology changes | Node additions, removals, or replacements may leave ranges inconsistent |
Use -pr on every node for routine repair |
Ensures each token range is repaired exactly once |
| Automate with AxonOps or Reaper | Eliminates human error and ensures schedule compliance |
Monitor percent repaired metric |
Detects tables where repair is falling behind |
| Monitor repair duration trends | Increasing duration indicates growing data volume or resource constraints |
Related Documentation¶
- Replica Synchronization — Hinted handoff, read reconciliation, and how repair fits in the broader synchronization model
- Repair Operations Guide — Operational procedures for running and managing repair
- Consistency — How consistency levels interact with convergence
- Tombstones — Deletion markers and gc_grace_seconds
- nodetool repair — Command reference
- nodetool repair_admin — Repair session management