Skip to content
Maintained by AxonOps — production-grade documentation from engineers who operate distributed databases at scale

Fault Tolerance

Kafka's fault tolerance mechanisms ensure data durability and service availability during failures.


Fault Tolerance Layers

uml diagram


Broker Failure Scenarios

Single Broker Failure

uml diagram

Phase Action
Detection Controller detects broker offline (session timeout)
Election Controller elects new leader from ISR
Recovery Returning broker catches up as follower

Multiple Broker Failures

Scenario Impact Mitigation
ISR drops below min.insync.replicas acks=all producers fail with NOT_ENOUGH_REPLICAS Set RF and min.insync.replicas for your failure budget
ISR empty Partition unavailable until a replica catches up Monitor ISR shrinkage
Unclean election Potential data loss Disable unclean election

Controller Failover

For complete Raft consensus mechanics, election protocols, and metadata recovery, see KRaft Deep Dive.

KRaft Controller Quorum

uml diagram

Failover Sequence

  1. Leader controller fails
  2. Remaining voters detect failure (heartbeat timeout)
  3. New election triggered
  4. Voter with most up-to-date log wins
  5. New leader resumes metadata operations

Rack Awareness

Distribute replicas across failure domains to survive rack/zone failures. For complete topology design including network architecture and multi-datacenter layouts, see Topology.

uml diagram

Configuration

# Broker configuration
broker.rack=rack1

# Replica placement
replica.selector.class=org.apache.kafka.common.replica.RackAwareReplicaSelector

Data Durability Settings

For detailed ISR mechanics, acknowledgment levels, and min.insync.replicas behavior, see Replication.

Producer Configuration

# Maximum durability
acks=all
retries=2147483647
delivery.timeout.ms=120000
enable.idempotence=true

# Ordering guarantee
max.in.flight.requests.per.connection=5

Broker Configuration

# Replication
default.replication.factor=3
min.insync.replicas=2
unclean.leader.election.enable=false

# Durability (optional overrides)
# log.flush.interval.messages=10000
# log.flush.interval.ms=1000

Durability note

The flush settings above are disabled by default (log.flush.interval.messages and the scheduler interval default to Long.MAX_VALUE). Explicit flush settings are not required for durability and reduce throughput.

Durability Rules

  • acks=all requires the ISR size to be ≥ min.insync.replicas, otherwise the write is rejected.
  • With acks=all, the system can lose up to RF - min.insync.replicas brokers without losing committed data.
  • acks=1 can acknowledge data that is not yet replicated; a leader failure can lose recent records.

Data Loss Conditions

Condition Result
Unclean leader election enabled Acknowledged data can be lost
acks=1 and leader fails before followers replicate Acknowledged data can be lost

Durability Table (Unclean Election Disabled)

acks min.insync.replicas RF Max broker failures without losing acknowledged data
1 1 3 0 (leader failure can lose recent records)
all 1 3 2
all 2 3 1
all 2 5 3

Failure Detection

Broker Detection

Mechanism Configuration Default
Session timeout broker.session.timeout.ms 9000
Heartbeat interval broker.heartbeat.interval.ms 2000

Broker heartbeat constraints

The controller enforces broker.heartbeat.interval.msbroker.session.timeout.ms / 2. The session timeout is configured on controllers; the heartbeat interval is configured on brokers.

Client Detection

Client Setting Default
Producer request.timeout.ms 30000
Consumer session.timeout.ms 45000
Consumer heartbeat.interval.ms 3000

Recovery Procedures

Partition Recovery

# Check offline partitions
kafka-topics.sh --bootstrap-server kafka:9092 \
  --describe --unavailable-partitions

# Check under-replicated
kafka-topics.sh --bootstrap-server kafka:9092 \
  --describe --under-replicated-partitions

# Force leader election (if ISR available)
kafka-leader-election.sh --bootstrap-server kafka:9092 \
  --election-type preferred \
  --topic my-topic \
  --partition 0

Unclean Recovery (Last Resort)

Data Loss Risk

Unclean leader election can result in data loss. Use only when availability is critical.

# Temporarily enable unclean election
kafka-configs.sh --bootstrap-server kafka:9092 \
  --entity-type topics \
  --entity-name my-topic \
  --alter \
  --add-config unclean.leader.election.enable=true

# Trigger election
kafka-leader-election.sh --bootstrap-server kafka:9092 \
  --election-type unclean \
  --topic my-topic \
  --partition 0

# Disable unclean election
kafka-configs.sh --bootstrap-server kafka:9092 \
  --entity-type topics \
  --entity-name my-topic \
  --alter \
  --delete-config unclean.leader.election.enable

Monitoring for Failures

Critical Alerts

Metric Condition Severity
OfflinePartitionsCount > 0 Critical
UnderReplicatedPartitions > 0 for 5min Warning
UnderMinIsrPartitionCount > 0 Critical
ActiveControllerCount ≠ 1 Critical

Health Checks

#!/bin/bash
# health-check.sh

# Check for offline partitions
OFFLINE=$(kafka-topics.sh --bootstrap-server kafka:9092 \
  --describe --unavailable-partitions 2>/dev/null | wc -l)

if [ "$OFFLINE" -gt 0 ]; then
  echo "CRITICAL: $OFFLINE offline partitions"
  exit 2
fi

# Check for under-replicated
UNDER_REP=$(kafka-topics.sh --bootstrap-server kafka:9092 \
  --describe --under-replicated-partitions 2>/dev/null | wc -l)

if [ "$UNDER_REP" -gt 0 ]; then
  echo "WARNING: $UNDER_REP under-replicated partitions"
  exit 1
fi

echo "OK: Cluster healthy"
exit 0

Best Practices

Practice Rationale
Use RF ≥ 3 Survive multiple failures
Set min.insync.replicas = 2 Ensure durability with acks=all
Disable unclean election Prevent data loss
Enable rack awareness Survive rack failures
Regular failover testing Validate recovery procedures
Monitor ISR shrinkage Detect issues early