Kafka Monitoring¶

Comprehensive monitoring guide for Apache Kafka clusters.

Monitoring Overview¶

Key Metrics Categories¶

Cluster Health¶

Metric	Description	Alert Threshold
`kafka.controller:ActiveControllerCount`	Active controller count	≠ 1
`kafka.server:UnderReplicatedPartitions`	Under-replicated partitions	> 0
`kafka.controller:OfflinePartitionsCount`	Offline partitions	> 0
`kafka.server:UnderMinIsrPartitionCount`	Below min ISR	> 0

Throughput¶

Metric	Description	Notes
`kafka.server:MessagesInPerSec`	Messages per second	Per broker/topic
`kafka.server:BytesInPerSec`	Bytes in per second	Per broker/topic
`kafka.server:BytesOutPerSec`	Bytes out per second	Per broker/topic
`kafka.server:TotalProduceRequestsPerSec`	Produce requests	Per broker
`kafka.server:TotalFetchRequestsPerSec`	Fetch requests	Per broker

Latency¶

Metric	Description	Alert Threshold
`kafka.network:TotalTimeMs,request=Produce`	Produce latency	P99 > 100ms
`kafka.network:TotalTimeMs,request=FetchConsumer`	Fetch latency	P99 > 100ms
`kafka.network:RequestQueueTimeMs`	Queue time	> 10ms
`kafka.network:ResponseQueueTimeMs`	Response queue time	> 10ms

Consumer Lag¶

Metric	Description	Alert Threshold
Consumer lag	Records behind	Growing continuously
Lag growth rate	Lag increase rate	Positive for extended period

JMX Configuration¶

Enable JMX¶

# Broker startup
export KAFKA_JMX_OPTS="-Dcom.sun.management.jmxremote \
  -Dcom.sun.management.jmxremote.authenticate=false \
  -Dcom.sun.management.jmxremote.ssl=false \
  -Dcom.sun.management.jmxremote.port=9999"

bin/kafka-server-start.sh config/server.properties

JMX Exporter¶

# jmx-exporter.yml
lowercaseOutputName: true
lowercaseOutputLabelNames: true

rules:
  # Broker metrics
  - pattern: kafka.server<type=(.+), name=(.+), topic=(.+)><>Count
    name: kafka_server_$1_$2_total
    labels:
      topic: "$3"
    type: COUNTER

  - pattern: kafka.server<type=(.+), name=(.+)><>Count
    name: kafka_server_$1_$2_total
    type: COUNTER

  # Request metrics
  - pattern: kafka.network<type=RequestMetrics, name=(.+), request=(.+)><>Count
    name: kafka_network_request_$1_total
    labels:
      request: "$2"
    type: COUNTER

  - pattern: kafka.network<type=RequestMetrics, name=(.+)Percentile, request=(.+)><>(\d+)thPercentile
    name: kafka_network_request_$1_percentile
    labels:
      request: "$2"
      percentile: "$3"
    type: GAUGE

  # Controller metrics
  - pattern: kafka.controller<type=(.+), name=(.+)><>Value
    name: kafka_controller_$1_$2
    type: GAUGE

Critical Alerts¶

Immediate Action Required¶

Alert	Condition	Action
Offline Partitions	`OfflinePartitionsCount > 0`	Investigate broker failures
No Controller	`ActiveControllerCount != 1`	Check controller election
Under Min ISR	`UnderMinIsrPartitionCount > 0`	Check broker health

Warning Level¶

Alert	Condition	Action
Under-Replicated	`UnderReplicatedPartitions > 0` for 5min	Check replication lag
High Produce Latency	P99 > 100ms	Check disk I/O, network
Consumer Lag Growing	Lag increasing continuously	Scale consumers
Disk Usage High	> 80% used	Add storage or adjust retention

Sample Alert Rules¶

# alert-rules.yml
groups:
  - name: kafka-critical
    rules:
      - alert: KafkaOfflinePartitions
        expr: kafka_controller_offline_partitions_count > 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Kafka has offline partitions"

      - alert: KafkaNoActiveController
        expr: kafka_controller_active_controller_count != 1
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Kafka cluster has no active controller"

      - alert: KafkaUnderReplicatedPartitions
        expr: kafka_server_replica_manager_under_replicated_partitions > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Kafka has under-replicated partitions"

      - alert: KafkaConsumerLagGrowing
        expr: rate(kafka_consumer_group_lag[5m]) > 0
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Consumer lag is continuously growing"

Consumer Lag Monitoring¶

Using kafka-consumer-groups¶

# Check lag for all groups
kafka-consumer-groups.sh --bootstrap-server kafka:9092 \
  --describe --all-groups

# Check specific group
kafka-consumer-groups.sh --bootstrap-server kafka:9092 \
  --describe --group my-consumer-group

Output Interpretation¶

GROUP           TOPIC           PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG
my-group        my-topic        0          1000            1050            50
my-group        my-topic        1          2000            2000            0
my-group        my-topic        2          1500            1600            100

Column	Description
CURRENT-OFFSET	Consumer's committed offset
LOG-END-OFFSET	Latest offset in partition
LAG	LOG-END-OFFSET - CURRENT-OFFSET

Dashboard Panels¶

Cluster Overview¶

Panel	Metrics
Active Controller	`kafka_controller_active_controller_count`
Online Brokers	Count of responding brokers
Offline Partitions	`kafka_controller_offline_partitions_count`
Under-Replicated	`kafka_server_replica_manager_under_replicated_partitions`

Throughput¶

Panel	Metrics
Messages In/s	`kafka_server_broker_topic_metrics_messages_in_total` rate
Bytes In/s	`kafka_server_broker_topic_metrics_bytes_in_total` rate
Bytes Out/s	`kafka_server_broker_topic_metrics_bytes_out_total` rate
Requests/s	`kafka_network_request_total` rate

Latency¶

Panel	Metrics
Produce P99	`kafka_network_request_total_time_ms{quantile="0.99"}`
Fetch P99	`kafka_network_request_total_time_ms{quantile="0.99"}`
Queue Time	`kafka_network_request_queue_time_ms`

Resources¶

Panel	Metrics
CPU Usage	Host CPU metrics
Memory Usage	Host memory metrics
Disk Usage	`kafka_log_size` per partition
Network I/O	Host network metrics

Health Check Script¶

#!/bin/bash
# kafka-health-check.sh

BOOTSTRAP_SERVER=${1:-"localhost:9092"}

echo "=== Kafka Health Check ==="

# Check broker connectivity
echo -n "Broker connectivity: "
if kafka-broker-api-versions.sh --bootstrap-server $BOOTSTRAP_SERVER > /dev/null 2>&1; then
    echo "OK"
else
    echo "FAILED"
    exit 1
fi

# Check offline partitions
OFFLINE=$(kafka-topics.sh --bootstrap-server $BOOTSTRAP_SERVER \
    --describe --unavailable-partitions 2>/dev/null | wc -l)
echo "Offline partitions: $OFFLINE"
if [ "$OFFLINE" -gt 0 ]; then
    echo "CRITICAL: Offline partitions detected"
    exit 2
fi

# Check under-replicated partitions
UNDER_REP=$(kafka-topics.sh --bootstrap-server $BOOTSTRAP_SERVER \
    --describe --under-replicated-partitions 2>/dev/null | wc -l)
echo "Under-replicated partitions: $UNDER_REP"
if [ "$UNDER_REP" -gt 0 ]; then
    echo "WARNING: Under-replicated partitions detected"
    exit 1
fi

echo "=== All checks passed ==="
exit 0

Replication Metrics¶

Metric	Description	Alert Threshold
`kafka.server:type=ReplicaManager,name=IsrShrinksPerSec`	ISR shrink rate	> 0 during normal operation
`kafka.server:type=ReplicaManager,name=IsrExpandsPerSec`	ISR expansion rate	Should follow shrinks
`kafka.server:type=ReplicaManager,name=FailedIsrUpdatesPerSec`	Failed ISR update rate	> 0
`kafka.server:type=ReplicaManager,name=LeaderCount`	Leader replicas per broker	Uneven distribution
`kafka.server:type=ReplicaManager,name=PartitionCount`	Partitions per broker	Uneven distribution
`kafka.server:type=ReplicaManager,name=OfflineReplicaCount`	Offline replicas	> 0
`kafka.server:type=ReplicaFetcherManager,name=MaxLag,clientId=Replica`	Max follower lag	Proportional to batch size

Request Processing Metrics¶

Request Time Breakdown¶

Metric	Description	Notes
`kafka.network:type=RequestMetrics,name=TotalTimeMs`	Total request time	Sum of all phases
`kafka.network:type=RequestMetrics,name=RequestQueueTimeMs`	Time waiting in request queue	High values indicate overload
`kafka.network:type=RequestMetrics,name=LocalTimeMs`	Time processing at leader	Disk I/O bound
`kafka.network:type=RequestMetrics,name=RemoteTimeMs`	Time waiting for followers	Non-zero with acks=all
`kafka.network:type=RequestMetrics,name=ResponseQueueTimeMs`	Time in response queue	Network thread saturation
`kafka.network:type=RequestMetrics,name=ResponseSendTimeMs`	Time sending response	Network bandwidth

Request Handler Utilization¶

Metric	Description	Alert Threshold
`kafka.network:type=SocketServer,name=NetworkProcessorAvgIdlePercent`	Network thread idle ratio	< 0.3
`kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent`	Request handler idle ratio	< 0.3
`kafka.network:type=RequestChannel,name=RequestQueueSize`	Pending requests	Growing continuously

Purgatory Metrics¶

Purgatory holds requests waiting for conditions to be met (e.g., acks from replicas).

Metric	Description	Notes
`kafka.server:type=DelayedOperationPurgatory,name=PurgatorySize,delayedOperation=Produce`	Pending produce requests	Non-zero with acks=-1
`kafka.server:type=DelayedOperationPurgatory,name=PurgatorySize,delayedOperation=Fetch`	Pending fetch requests	Depends on fetch.wait.max.ms

Log and Storage Metrics¶

Metric	Description	Notes
`kafka.log:type=LogFlushStats,name=LogFlushRateAndTimeMs`	Log flush rate and time	Disk performance indicator
`kafka.log:type=LogManager,name=OfflineLogDirectoryCount`	Offline log directories	Should be 0
`kafka.log:type=Log,name=Size,topic=X,partition=Y`	Partition size in bytes	Per-partition storage
`kafka.log:type=Log,name=NumLogSegments,topic=X,partition=Y`	Segment count per partition	Segment management
`kafka.log:type=Log,name=LogStartOffset,topic=X,partition=Y`	First available offset	Retention tracking
`kafka.log:type=Log,name=LogEndOffset,topic=X,partition=Y`	Latest offset	Progress tracking

Controller Metrics¶

Metric	Description	Alert Threshold
`kafka.controller:type=ControllerStats,name=LeaderElectionRateAndTimeMs`	Leader election rate	Non-zero during failures
`kafka.controller:type=ControllerStats,name=UncleanLeaderElectionsPerSec`	Unclean elections	> 0 (potential data loss)
`kafka.controller:type=KafkaController,name=TopicsToDeleteCount`	Pending topic deletions	Should decrease
`kafka.controller:type=KafkaController,name=ReplicasToDeleteCount`	Pending replica deletions	Should decrease
`kafka.controller:type=ControllerEventManager,name=EventQueueSize`	Controller event queue	Growing continuously
`kafka.controller:type=ControllerEventManager,name=EventQueueTimeMs`	Event wait time	High latency

KRaft Monitoring¶

KRaft clusters expose Raft consensus metrics on both controllers and brokers.

Quorum State Metrics¶

Metric	Description	Notes
`kafka.server:type=raft-metrics,name=current-state`	Node state	leader, follower, candidate, observer
`kafka.server:type=raft-metrics,name=current-leader`	Current leader ID	-1 indicates unknown
`kafka.server:type=raft-metrics,name=current-epoch`	Current quorum epoch	Increments on elections
`kafka.server:type=raft-metrics,name=high-watermark`	Committed log offset	-1 if unknown
`kafka.server:type=raft-metrics,name=log-end-offset`	End of Raft log	Replication progress

Quorum Performance Metrics¶

Metric	Description	Alert Threshold
`kafka.server:type=raft-metrics,name=commit-latency-avg`	Average commit latency	Increasing trend
`kafka.server:type=raft-metrics,name=commit-latency-max`	Maximum commit latency	Spikes
`kafka.server:type=raft-metrics,name=election-latency-avg`	Average election time	Extended elections
`kafka.server:type=raft-metrics,name=fetch-records-rate`	Record fetch rate	Replication throughput
`kafka.server:type=raft-metrics,name=append-records-rate`	Record append rate	Write throughput

Group Coordinator Monitoring¶

The group coordinator manages consumer group membership and offset storage.

Partition State Metrics¶

Metric	Description	Notes
`kafka.server:type=group-coordinator-metrics,name=num-partitions,state=loading`	Loading partitions	Should be transient
`kafka.server:type=group-coordinator-metrics,name=num-partitions,state=active`	Active partitions	Normal operation
`kafka.server:type=group-coordinator-metrics,name=num-partitions,state=failed`	Failed partitions	Should be 0

Consumer Group State Metrics¶

Metric	Description	Notes
`kafka.server:type=group-coordinator-metrics,name=consumer-group-count,state=stable`	Stable groups	Normal state
`kafka.server:type=group-coordinator-metrics,name=consumer-group-count,state=empty`	Empty groups	No active members
`kafka.server:type=group-coordinator-metrics,name=consumer-group-count,state=assigning`	Groups assigning partitions	Rebalance in progress
`kafka.server:type=group-coordinator-metrics,name=consumer-group-count,state=reconciling`	Groups reconciling	Incremental rebalance
`kafka.server:type=group-coordinator-metrics,name=consumer-group-rebalance-rate`	Rebalance frequency	High rate indicates instability

Offset Management Metrics¶

Metric	Description	Notes
`kafka.server:type=group-coordinator-metrics,name=offset-commit-rate`	Offset commit rate	Consumer activity
`kafka.server:type=group-coordinator-metrics,name=offset-expiration-rate`	Offset expiration rate	Inactive consumers
`kafka.server:type=GroupMetadataManager,name=NumOffsets`	Total committed offsets	Storage overhead
`kafka.server:type=GroupMetadataManager,name=NumGroups`	Total consumer groups	Group management

Tiered Storage Monitoring¶

For clusters with tiered storage enabled, monitor remote storage operations.

Remote Storage Throughput¶

Metric	Description	Notes
`kafka.server:type=BrokerTopicMetrics,name=RemoteFetchBytesPerSec`	Bytes read from remote	Cold read volume
`kafka.server:type=BrokerTopicMetrics,name=RemoteFetchRequestsPerSec`	Remote fetch requests	Cold read frequency
`kafka.server:type=BrokerTopicMetrics,name=RemoteCopyBytesPerSec`	Bytes copied to remote	Upload throughput
`kafka.server:type=BrokerTopicMetrics,name=RemoteCopyRequestsPerSec`	Copy requests to remote	Upload frequency
`kafka.server:type=BrokerTopicMetrics,name=RemoteDeleteRequestsPerSec`	Delete requests	Retention cleanup

Remote Storage Lag¶

Metric	Description	Alert Threshold
`kafka.server:type=BrokerTopicMetrics,name=RemoteCopyLagBytes`	Bytes pending upload	Growing continuously
`kafka.server:type=BrokerTopicMetrics,name=RemoteCopyLagSegments`	Segments pending upload	> configured threshold
`kafka.server:type=BrokerTopicMetrics,name=RemoteDeleteLagBytes`	Bytes pending deletion	Growing continuously
`kafka.server:type=BrokerTopicMetrics,name=RemoteDeleteLagSegments`	Segments pending deletion	> configured threshold

Remote Storage Errors¶

Metric	Description	Alert Threshold
`kafka.server:type=BrokerTopicMetrics,name=RemoteFetchErrorsPerSec`	Remote read errors	> 0
`kafka.server:type=BrokerTopicMetrics,name=RemoteCopyErrorsPerSec`	Remote write errors	> 0
`kafka.server:type=BrokerTopicMetrics,name=RemoteDeleteErrorsPerSec`	Remote delete errors	> 0

Remote Storage Thread Pool¶

Metric	Description	Alert Threshold
`org.apache.kafka.storage.internals.log:type=RemoteStorageThreadPool,name=RemoteLogReaderTaskQueueSize`	Read task queue	Growing continuously
`org.apache.kafka.storage.internals.log:type=RemoteStorageThreadPool,name=RemoteLogReaderAvgIdlePercent`	Read thread utilization	< 0.3
`kafka.log.remote:type=RemoteLogManager,name=RemoteLogManagerTasksAvgIdlePercent`	Copy thread utilization	< 0.3

Producer Client Metrics¶

Client-side metrics for monitoring producer applications.

Throughput Metrics¶

Metric	Description	Notes
`kafka.producer:type=producer-metrics,name=record-send-rate`	Records sent per second	Production rate
`kafka.producer:type=producer-metrics,name=byte-rate`	Bytes sent per second	Bandwidth usage
`kafka.producer:type=producer-metrics,name=compression-rate-avg`	Compression ratio	< 1.0 indicates compression
`kafka.producer:type=producer-metrics,name=record-size-avg`	Average record size	Sizing validation

Latency Metrics¶

Metric	Description	Alert Threshold
`kafka.producer:type=producer-metrics,name=request-latency-avg`	Average request latency	Increasing trend
`kafka.producer:type=producer-metrics,name=request-latency-max`	Maximum request latency	Spikes
`kafka.producer:type=producer-metrics,name=record-queue-time-avg`	Time in buffer	High indicates backpressure
`kafka.producer:type=producer-metrics,name=produce-throttle-time-avg`	Throttle time	> 0 indicates quota hit

Buffer Metrics¶

Metric	Description	Alert Threshold
`kafka.producer:type=producer-metrics,name=buffer-available-bytes`	Available buffer space	Approaching 0
`kafka.producer:type=producer-metrics,name=buffer-total-bytes`	Total buffer size	Configuration reference
`kafka.producer:type=producer-metrics,name=bufferpool-wait-ratio`	Time waiting for buffer	> 0 indicates memory pressure
`kafka.producer:type=producer-metrics,name=batch-size-avg`	Average batch size	Tuning indicator

Error Metrics¶

Metric	Description	Alert Threshold
`kafka.producer:type=producer-metrics,name=record-error-rate`	Record error rate	> 0
`kafka.producer:type=producer-metrics,name=record-retry-rate`	Record retry rate	High rate

Consumer Client Metrics¶

Client-side metrics for monitoring consumer applications.

Throughput Metrics¶

Metric	Description	Notes
`kafka.consumer:type=consumer-fetch-manager-metrics,name=records-consumed-rate`	Records consumed per second	Consumption rate
`kafka.consumer:type=consumer-fetch-manager-metrics,name=bytes-consumed-rate`	Bytes consumed per second	Bandwidth usage
`kafka.consumer:type=consumer-fetch-manager-metrics,name=fetch-rate`	Fetch request rate	Request frequency
`kafka.consumer:type=consumer-fetch-manager-metrics,name=records-per-request-avg`	Records per fetch	Efficiency indicator

Lag Metrics¶

Metric	Description	Alert Threshold
`kafka.consumer:type=consumer-fetch-manager-metrics,name=records-lag-max`	Maximum partition lag	Growing continuously
`kafka.consumer:type=consumer-fetch-manager-metrics,name=records-lag,partition=X`	Per-partition lag	Above threshold
`kafka.consumer:type=consumer-fetch-manager-metrics,name=records-lead-min`	Minimum lead (distance to start)	Approaching 0

Rebalance Metrics¶

Metric	Description	Alert Threshold
`kafka.consumer:type=consumer-coordinator-metrics,name=rebalance-total`	Total rebalances	High count
`kafka.consumer:type=consumer-coordinator-metrics,name=rebalance-rate-per-hour`	Rebalance frequency	> 1-2 per hour
`kafka.consumer:type=consumer-coordinator-metrics,name=rebalance-latency-avg`	Average rebalance time	> configured session timeout
`kafka.consumer:type=consumer-coordinator-metrics,name=assigned-partitions`	Assigned partition count	Uneven distribution

Heartbeat Metrics¶

Metric	Description	Alert Threshold
`kafka.consumer:type=consumer-coordinator-metrics,name=heartbeat-rate`	Heartbeats per second	Below expected rate
`kafka.consumer:type=consumer-coordinator-metrics,name=heartbeat-response-time-max`	Max heartbeat response time	Approaching session timeout
`kafka.consumer:type=consumer-coordinator-metrics,name=last-heartbeat-seconds-ago`	Time since last heartbeat	Approaching session timeout

Commit Metrics¶

Metric	Description	Notes
`kafka.consumer:type=consumer-coordinator-metrics,name=commit-rate`	Commit rate	Commit frequency
`kafka.consumer:type=consumer-coordinator-metrics,name=commit-latency-avg`	Average commit latency	Performance indicator

Kafka Streams Metrics¶

For Kafka Streams applications, monitor stream processing performance.

Thread Metrics¶

Metric	Description	Notes
`kafka.streams:type=stream-thread-metrics,name=state`	Thread state	RUNNING, PARTITIONS_ASSIGNED, etc.
`kafka.streams:type=stream-thread-metrics,name=commit-rate`	Commits per second	Processing frequency
`kafka.streams:type=stream-thread-metrics,name=poll-rate`	Polls per second	Input rate
`kafka.streams:type=stream-thread-metrics,name=process-rate`	Records processed per second	Processing throughput

Processing Latency¶

Metric	Description	Alert Threshold
`kafka.streams:type=stream-thread-metrics,name=process-latency-avg`	Average processing time	Increasing trend
`kafka.streams:type=stream-thread-metrics,name=commit-latency-avg`	Average commit time	High latency
`kafka.streams:type=stream-thread-metrics,name=poll-latency-avg`	Average poll time	High latency
`kafka.streams:type=stream-thread-metrics,name=punctuate-latency-avg`	Average punctuate time	High latency

Task Metrics¶

Metric	Description	Notes
`kafka.streams:type=stream-thread-metrics,name=task-created-rate`	Task creation rate	Rebalance activity
`kafka.streams:type=stream-thread-metrics,name=task-closed-rate`	Task close rate	Rebalance activity
`kafka.streams:type=stream-task-metrics,name=process-rate`	Per-task processing rate	Task-level throughput
`kafka.streams:type=stream-task-metrics,name=dropped-records-rate`	Dropped record rate	Data loss indicator

State Store Metrics¶

Metric	Description	Notes
`kafka.streams:type=stream-state-metrics,name=put-rate`	State store write rate	Write throughput
`kafka.streams:type=stream-state-metrics,name=get-rate`	State store read rate	Read throughput
`kafka.streams:type=stream-state-metrics,name=flush-rate`	State store flush rate	Persistence frequency
`kafka.streams:type=stream-state-metrics,name=restore-rate`	State restoration rate	Recovery progress

Quota Metrics¶

Monitor client quota enforcement.

Metric	Description	Notes
`kafka.server:type=Produce,user=X,client-id=Y,name=throttle-time`	Producer throttle time	> 0 indicates quota exceeded
`kafka.server:type=Fetch,user=X,client-id=Y,name=throttle-time`	Consumer throttle time	> 0 indicates quota exceeded
`kafka.server:type=Request,user=X,client-id=Y,name=throttle-time`	Request throttle time	> 0 indicates quota exceeded

Security Metrics¶

Monitor authentication and authorization.

Metric	Description	Alert Threshold
`kafka.server:type=socket-server-metrics,name=successful-authentication-rate`	Successful auth rate	Reference baseline
`kafka.server:type=socket-server-metrics,name=failed-authentication-rate`	Failed auth rate	> 0
`kafka.network:type=SocketServer,name=ExpiredConnectionsKilledCount`	Connections killed (auth expiry)	> 0 with re-auth enabled

Kafka 4.2 Metrics Changes¶

Metric Naming Convention (KIP-1100)

Kafka 4.2 corrects metric names to follow the kafka.COMPONENT naming convention. Some metric names from earlier versions have been renamed. Monitor for any dashboard or alerting rule breakage after upgrading.

New Metrics in Kafka 4.2¶

Metric	KIP	Description
`kafka.controller:AvgIdleRatio`	KIP-1190	Controller thread idle ratio. Low values indicate the controller is under heavy load
`kafka.server:AvgIdleRatio` (MetadataLoader)	KIP-1229	MetadataLoader thread idle ratio. Monitors metadata processing capacity
`kafka.server:RequestHandlerAvgIdlePercent`	KIP-1207	Fixed in KRaft combined mode to report accurately (previously incorrect in combined controller+broker nodes)
Feature level metrics	KIP-1180	Generic metrics for finalized and supported feature levels across the cluster
`client-id` tag on AppInfo	KIP-1120	AppInfo metrics now include a `client-id` tag for distinguishing between client instances
`application-id` tag on Streams state	KIP-1221	Kafka Streams client state metric now includes an `application-id` tag
Share partition lag	KIP-1226	Lag metrics for share group partition consumption progress

Operations - Operations overview
CLI Tools - Command reference
Troubleshooting - Problem diagnosis
Performance - Performance tuning

Kafka Monitoring¶

Monitoring Overview¶

Key Metrics Categories¶

Cluster Health¶

Throughput¶

Latency¶

Consumer Lag¶

JMX Configuration¶

Enable JMX¶

JMX Exporter¶

Critical Alerts¶

Immediate Action Required¶

Warning Level¶

Sample Alert Rules¶

Consumer Lag Monitoring¶

Using kafka-consumer-groups¶

Output Interpretation¶

Dashboard Panels¶

Cluster Overview¶

Throughput¶

Latency¶

Resources¶

Health Check Script¶

Replication Metrics¶

Request Processing Metrics¶

Request Time Breakdown¶

Request Handler Utilization¶

Purgatory Metrics¶

Log and Storage Metrics¶

Controller Metrics¶

KRaft Monitoring¶

Quorum State Metrics¶

Quorum Performance Metrics¶

Group Coordinator Monitoring¶

Partition State Metrics¶

Consumer Group State Metrics¶

Offset Management Metrics¶

Tiered Storage Monitoring¶

Remote Storage Throughput¶

Remote Storage Lag¶

Remote Storage Errors¶

Remote Storage Thread Pool¶

Producer Client Metrics¶

Throughput Metrics¶

Latency Metrics¶

Buffer Metrics¶

Error Metrics¶

Consumer Client Metrics¶

Throughput Metrics¶

Lag Metrics¶

Rebalance Metrics¶

Heartbeat Metrics¶

Commit Metrics¶

Kafka Streams Metrics¶

Thread Metrics¶

Processing Latency¶

Task Metrics¶

State Store Metrics¶

Quota Metrics¶

Security Metrics¶

Kafka 4.2 Metrics Changes¶

New Metrics in Kafka 4.2¶

Related Documentation¶