Adaptive Repair Service¶

Repairs must be completed regularly to maintain Cassandra nodes.

AxonOps provides two mechanisms to ease management of repairs in Cassandra:

Adaptive Repair Service
Scheduled Repairs

Adaptive Repair Service¶

Since AxonOps collects performance metrics and logs, it includes an adaptive repair system that regulates repair velocity (parallelism and pauses between each subrange repair) based on performance trends. The regulation of repair velocity takes input from various metrics including:

CPU utilization
Query latencies
Cassandra thread pools pending statistics
I/O wait percentage
Tracking of the repair schedule based on gc_grace_seconds for each table

The goal is to achieve the following:

Completion of repair within gc_grace_seconds of each table.
Repair process does not affect query performance.
In essence, the adaptive repair regulator slows down the repair velocity when it detects an increase in load and speeds up to catch up with the repair schedule when resources are more readily available.
This mechanism does not require JMX access. The adaptive repair service running on AxonOps server orchestrates and issues commands to the agents over the existing connection.

From a user's point of view, there is a single switch to enable this service. Keep it enabled and AxonOps will take care of repairs for all tables.

You can, however, customize the following:

Exclude tables

Skip specific tables from automatic repair
Parallel processing

Set how many tables to repair simultaneously
Segment size

Splits each table into segments of up to this size and repairs each segment in turn.
GC grace threshold

If a table has a gc_grace_seconds value lower than the specified threshold, the table will be ignored by the adaptive repair service.
Max total segments per table

Maximum number of segments to split each table into for repair (range: 1 to 1,000,000). Larger tables are divided into more segments to restrict the repair time of each segment.

Increasing Data Consistency¶

To keep tables as up-to-date as possible, we recommend both:

Increasing the Concurrent Repair Processes to be greater than the total number of tables in the cluster.
Reducing the Target Segment Size to generate fewer repair requests.

Scheduled Repairs¶

You can initiate two types of scheduled repairs with AxonOps.

The above screenshot showcases a running repair that has been initiated immediately and a scheduled repair that is scheduled for 12:00 AM UTC.

Immediate Repairs¶

These will trigger immediately once.

Cron Scheduled Repairs¶

These will trigger based on the selected schedule repeatedly.

Alert After¶

You can set a maximum duration for scheduled and manual repairs using the Alert After setting. If a repair exceeds this duration, AxonOps sends a warning alert to notify you that the repair is taking longer than expected.

This is useful for:

Detecting repairs that are stuck or progressing slower than expected.
Ensuring scheduled repairs complete within a maintenance window.
Proactively identifying performance issues that may be affecting repair speed.

When the maximum duration is reached:

A warning severity alert is raised with the message: "Scheduled repair {name} has reached its maximum duration of {duration}".
The alert is sent to any configured notification integrations (Slack, PagerDuty, email, etc.) using the repair alert routing.
An event is logged in the repair history.
The repair continues running since the alert is informational and does not stop the repair.

The alert is sent once per repair run. When the repair plan resets (for example, on the next scheduled run), the alert state is cleared.

Note

The Alert After maximum duration setting applies to scheduled and manual repairs only. Adaptive repairs manage their own pacing and do not use this setting.