Fault Tolerance¶
Kafka's fault tolerance mechanisms ensure data durability and service availability during failures.
Fault Tolerance Layers¶
Broker Failure Scenarios¶
Single Broker Failure¶
| Phase | Action |
|---|---|
| Detection | Controller detects broker offline (session timeout) |
| Election | Controller elects new leader from ISR |
| Recovery | Returning broker catches up as follower |
Multiple Broker Failures¶
| Scenario | Impact | Mitigation |
|---|---|---|
ISR drops below min.insync.replicas |
acks=all producers fail with NOT_ENOUGH_REPLICAS |
Set RF and min.insync.replicas for your failure budget |
| ISR empty | Partition unavailable until a replica catches up | Monitor ISR shrinkage |
| Unclean election | Potential data loss | Disable unclean election |
Controller Failover¶
For complete Raft consensus mechanics, election protocols, and metadata recovery, see KRaft Deep Dive.
KRaft Controller Quorum¶
Failover Sequence¶
- Leader controller fails
- Remaining voters detect failure (heartbeat timeout)
- New election triggered
- Voter with most up-to-date log wins
- New leader resumes metadata operations
Rack Awareness¶
Distribute replicas across failure domains to survive rack/zone failures. For complete topology design including network architecture and multi-datacenter layouts, see Topology.
Configuration¶
# Broker configuration
broker.rack=rack1
# Replica placement
replica.selector.class=org.apache.kafka.common.replica.RackAwareReplicaSelector
Data Durability Settings¶
For detailed ISR mechanics, acknowledgment levels, and min.insync.replicas behavior, see Replication.
Producer Configuration¶
# Maximum durability
acks=all
retries=2147483647
delivery.timeout.ms=120000
enable.idempotence=true
# Ordering guarantee
max.in.flight.requests.per.connection=5
Broker Configuration¶
# Replication
default.replication.factor=3
min.insync.replicas=2
unclean.leader.election.enable=false
# Durability (optional overrides)
# log.flush.interval.messages=10000
# log.flush.interval.ms=1000
Durability note
The flush settings above are disabled by default (log.flush.interval.messages and the scheduler interval default to Long.MAX_VALUE). Explicit flush settings are not required for durability and reduce throughput.
Durability Rules¶
acks=allrequires the ISR size to be ≥min.insync.replicas, otherwise the write is rejected.- With
acks=all, the system can lose up toRF - min.insync.replicasbrokers without losing committed data. acks=1can acknowledge data that is not yet replicated; a leader failure can lose recent records.
Data Loss Conditions¶
| Condition | Result |
|---|---|
| Unclean leader election enabled | Acknowledged data can be lost |
acks=1 and leader fails before followers replicate |
Acknowledged data can be lost |
Durability Table (Unclean Election Disabled)¶
| acks | min.insync.replicas | RF | Max broker failures without losing acknowledged data |
|---|---|---|---|
| 1 | 1 | 3 | 0 (leader failure can lose recent records) |
| all | 1 | 3 | 2 |
| all | 2 | 3 | 1 |
| all | 2 | 5 | 3 |
Failure Detection¶
Broker Detection¶
| Mechanism | Configuration | Default |
|---|---|---|
| Session timeout | broker.session.timeout.ms |
9000 |
| Heartbeat interval | broker.heartbeat.interval.ms |
2000 |
Broker heartbeat constraints
The controller enforces broker.heartbeat.interval.ms ≤ broker.session.timeout.ms / 2. The session timeout is configured on controllers; the heartbeat interval is configured on brokers.
Client Detection¶
| Client | Setting | Default |
|---|---|---|
| Producer | request.timeout.ms |
30000 |
| Consumer | session.timeout.ms |
45000 |
| Consumer | heartbeat.interval.ms |
3000 |
Recovery Procedures¶
Partition Recovery¶
# Check offline partitions
kafka-topics.sh --bootstrap-server kafka:9092 \
--describe --unavailable-partitions
# Check under-replicated
kafka-topics.sh --bootstrap-server kafka:9092 \
--describe --under-replicated-partitions
# Force leader election (if ISR available)
kafka-leader-election.sh --bootstrap-server kafka:9092 \
--election-type preferred \
--topic my-topic \
--partition 0
Unclean Recovery (Last Resort)¶
Data Loss Risk
Unclean leader election can result in data loss. Use only when availability is critical.
# Temporarily enable unclean election
kafka-configs.sh --bootstrap-server kafka:9092 \
--entity-type topics \
--entity-name my-topic \
--alter \
--add-config unclean.leader.election.enable=true
# Trigger election
kafka-leader-election.sh --bootstrap-server kafka:9092 \
--election-type unclean \
--topic my-topic \
--partition 0
# Disable unclean election
kafka-configs.sh --bootstrap-server kafka:9092 \
--entity-type topics \
--entity-name my-topic \
--alter \
--delete-config unclean.leader.election.enable
Monitoring for Failures¶
Critical Alerts¶
| Metric | Condition | Severity |
|---|---|---|
OfflinePartitionsCount |
> 0 | Critical |
UnderReplicatedPartitions |
> 0 for 5min | Warning |
UnderMinIsrPartitionCount |
> 0 | Critical |
ActiveControllerCount |
≠ 1 | Critical |
Health Checks¶
#!/bin/bash
# health-check.sh
# Check for offline partitions
OFFLINE=$(kafka-topics.sh --bootstrap-server kafka:9092 \
--describe --unavailable-partitions 2>/dev/null | wc -l)
if [ "$OFFLINE" -gt 0 ]; then
echo "CRITICAL: $OFFLINE offline partitions"
exit 2
fi
# Check for under-replicated
UNDER_REP=$(kafka-topics.sh --bootstrap-server kafka:9092 \
--describe --under-replicated-partitions 2>/dev/null | wc -l)
if [ "$UNDER_REP" -gt 0 ]; then
echo "WARNING: $UNDER_REP under-replicated partitions"
exit 1
fi
echo "OK: Cluster healthy"
exit 0
Best Practices¶
| Practice | Rationale |
|---|---|
| Use RF ≥ 3 | Survive multiple failures |
| Set min.insync.replicas = 2 | Ensure durability with acks=all |
| Disable unclean election | Prevent data loss |
| Enable rack awareness | Survive rack failures |
| Regular failover testing | Validate recovery procedures |
| Monitor ISR shrinkage | Detect issues early |
Related Documentation¶
- Replication - Replication protocol
- Brokers - Broker architecture
- Multi-Datacenter - DR strategies
- Operations - Operational procedures