Skip to content
Maintained by AxonOps — production-grade documentation from engineers who operate distributed databases at scale

Repair Failures

Repairs synchronize data across replicas to ensure consistency. Repair failures leave data inconsistent and can lead to read inconsistencies.


Symptoms

  • nodetool repair exits with errors
  • Repairs hang indefinitely
  • "Repair session failed" in logs
  • Incremental repair streams failing
  • Long-running repairs that never complete
  • OOM during repair

Diagnosis

Step 1: Check Active Repairs

nodetool repair_admin list

Step 2: Check Repair History

# View recent repairs
cqlsh -e "SELECT * FROM system_distributed.repair_history LIMIT 20;"

# Check parent repair sessions
cqlsh -e "SELECT * FROM system_distributed.parent_repair_history LIMIT 10;"

Step 3: Check Logs for Errors

grep -i "repair\|streaming\|merkle" /var/log/cassandra/system.log | tail -100

Common error patterns: - Repair session failed - Sync failed between - Streaming error - OutOfMemoryError during repair

Step 4: Check Resource Usage

# During repair
top -p $(pgrep -f CassandraDaemon)
iostat -x 1 5
df -h /var/lib/cassandra

Step 5: Check Stream Throughput

nodetool netstats

Resolution

Case 1: Repair Session Stuck

Cancel and restart:

# List active repairs
nodetool repair_admin list

# Cancel stuck repair
nodetool repair_admin cancel <repair-id>

# Or cancel all repairs on node
nodetool repair_admin cancel --force

# Restart with smaller scope
nodetool repair -pr my_keyspace my_table

Case 2: OOM During Repair

Reduce repair scope:

# Repair one table at a time
nodetool repair -pr my_keyspace table1
nodetool repair -pr my_keyspace table2

# Use subrange repair for large tables
nodetool repair -pr -st <start_token> -et <end_token> my_keyspace

Adjust memory settings:

# Reduce merkle tree memory
# In cassandra.yaml
repair_session_max_tree_depth: 18  # Default 20, reduce for large partitions

Case 3: Streaming Failures

Check network:

# Verify streaming ports
nc -zv <peer-node> 7000

# Check streaming throughput limit
nodetool getstreamthroughput

Increase timeouts:

# cassandra.yaml
streaming_socket_timeout_in_ms: 86400000  # 24 hours
streaming_keep_alive_period_in_secs: 300

Case 4: Repair Taking Too Long

Use parallel repair:

# Parallel repair (Cassandra 4.0+)
nodetool repair -pr --parallel my_keyspace

Increase stream throughput:

# Check current setting
nodetool getstreamthroughput

# Increase if network allows (MB/s)
nodetool setstreamthroughput 200

Schedule repairs by token range:

#!/bin/bash
# Repair in smaller chunks
ranges=$(nodetool describering my_keyspace | grep TokenRange | head -10)
for range in $ranges; do
    start=$(echo $range | cut -d'(' -f2 | cut -d',' -f1)
    end=$(echo $range | cut -d',' -f2 | cut -d')' -f1)
    nodetool repair -st $start -et $end my_keyspace
done

Case 5: Incremental Repair Issues

Switch to full repair:

# Full repair instead of incremental
nodetool repair -full -pr my_keyspace

Reset repair state:

# Mark SSTables as unrepaired (use with caution)
nodetool repair_admin cancel --force
sstablerepairedset --really-set --is-unrepaired /var/lib/cassandra/data/my_keyspace/my_table-*/*.db

Case 6: Schema Disagreement Blocking Repair

# Check schema
nodetool describecluster

# Fix schema disagreement first (see schema-disagreement.md)
nodetool reloadlocalschema

# Then retry repair
nodetool repair -pr my_keyspace

Recovery

Verify Repair Completion

# Check repair history
cqlsh -e "SELECT * FROM system_distributed.repair_history WHERE keyspace_name = 'my_keyspace' LIMIT 5;"

# Verify no pending repairs
nodetool repair_admin list

Verify Data Consistency

# Run read repair on critical data
cqlsh -e "SELECT * FROM my_keyspace.my_table WHERE ... ;"
# With consistency ALL to force read repair

Repair Best Practices

Scheduling

Cluster Size Repair Frequency Strategy
< 10 nodes Weekly Full cluster repair
10-50 nodes Weekly per node Rolling repair
> 50 nodes Sub-range daily Token range repair

Command Options

# Primary range only (most common)
nodetool repair -pr my_keyspace

# Full repair (vs incremental)
nodetool repair -full -pr my_keyspace

# Specific tables
nodetool repair -pr my_keyspace table1 table2

# Parallel (Cassandra 4.0+)
nodetool repair -pr --parallel my_keyspace

# Local datacenter only
nodetool repair -pr -local my_keyspace

Resource Management

# cassandra.yaml - repair settings
repair_session_max_tree_depth: 18
repair_session_space_in_mb: 256

# Limit repair impact
compaction_throughput_mb_per_sec: 64
stream_throughput_outbound_megabits_per_sec: 200

Prevention

  1. Schedule regular repairs - Run before gc_grace_seconds expires
  2. Monitor repair duration - Alert if repairs take > 24 hours
  3. Size partitions appropriately - Large partitions cause OOM during repair
  4. Maintain cluster health - Repair requires all replicas available
  5. Use repair tools - Consider Reaper for automated repair scheduling

Command Purpose
nodetool repair Run repair
nodetool repair_admin list List active repairs
nodetool repair_admin cancel Cancel repair
nodetool netstats Check streaming status
nodetool setstreamthroughput Adjust stream speed