Skip to content
Maintained by AxonOps — production-grade documentation from engineers who operate distributed databases at scale Get Cassandra Help Get Kafka Help

AxonOps Kafka System Dashboard Metrics Mapping

Overview

The Kafka System Dashboard provides comprehensive monitoring of system-level resources across your Kafka cluster. It tracks CPU utilization, disk I/O, memory usage, JVM performance, and network statistics to ensure optimal cluster health and performance.

Metrics Mapping

Dashboard Metric Description Attributes
CPU Metrics
host_CPU_Percent_Merge Overall CPU utilization percentage time='real'
host_CPU (mode='iowait') CPU time waiting for I/O operations mode='iowait'
host_load15 15-minute load average -
Disk Metrics
host_Disk_UsedPercent Percentage of disk space used -
host_Disk_Used Absolute disk space used in bytes -
host_Disk_SectorsRead Disk read throughput in bytes/sec -
host_Disk_SectorsWrite Disk write throughput in bytes/sec -
host_Disk_IOCount Input/output operations per second -
host_Disk_avgqsz Average disk queue size -
host_Disk_WeightedIO Weighted I/O time in milliseconds -
host_Disk_IoTime Time spent on disk I/O operations -
host_filefd_allocated Number of allocated file descriptors -
host_filefd_max Maximum available file descriptors -
Memory Metrics
host_Memory_Used Used system memory in bytes -
host_Memory_Cached Cached memory in bytes -
host_Memory_UsedPercent Memory usage percentage -
JVM Metrics
jvm_Threading_ JVM thread count type=Threading
jvm_GarbageCollector_G1_Young_Generation G1 GC statistics name=G1 Young Generation
jvm_GarbageCollector_ZGC ZGC statistics name=ZGC
jvm_GarbageCollector_Shenandoah_Cycles Shenandoah GC statistics name=Shenandoah
jvm_GarbageCollector_ConcurrentMarkSweep CMS GC statistics name=ConcurrentMarkSweep
jvm_GarbageCollector_ParNew ParNew GC statistics name=ParNew
jvm_Memory_ JVM heap and non-heap memory usage type=Memory
Network Metrics
host_netIOCounters_BytesRecv Network bytes received per second -
host_netIOCounters_BytesSent Network bytes sent per second -
host_ntp_offset_seconds NTP time offset in seconds -

Query Examples

CPU Usage by Rack

avg(host_CPU_Percent_Merge{time='real',rack=~'$rack',host_id=~'$host_id', node_type='$node_type', type='kafka'}) by (rack)

I/O Wait Percentage

avg(host_CPU{axonfunction='rate',mode='iowait',rack=~'$rack',host_id=~'$host_id',node_type='$node_type', type='kafka'}) by (host_id) * 100

Disk I/O Throughput

// Read throughput
host_Disk_SectorsRead{axonfunction='rate',rack=~'$rack',host_id=~'$host_id',partition=~'$partition',node_type='$node_type', type='kafka'}

// Write throughput
host_Disk_SectorsWrite{axonfunction='rate',rack=~'$rack',host_id=~'$host_id',partition=~'$partition', node_type='$node_type', type='kafka'}

File Descriptor Usage

(host_filefd_allocated{rack=~'$rack',host_id=~'$host_id'} / host_filefd_max{rack=~'$rack',host_id=~'$host_id', node_type='$node_type', type='kafka'})*100

JVM Heap Memory Usage

jvm_Memory_{function='used',scope='HeapMemoryUsage',rack=~'$rack',host_id=~'$host_id', node_type='$node_type', type='kafka'}

GC Rate

jvm_GarbageCollector_G1_Young_Generation{axonfunction='rate',function='CollectionCount',rack=~'$rack',host_id=~'$host_id', node_type='$node_type', type='kafka'}

Panel Organization

Overview Section

  • Average CPU Usage per Rack

CPU and Load

  • CPU usage per host
  • Load Average (15m)
  • Avg IO wait CPU per Host

Disk Statistics

  • Disk % Usage by mount point
  • Used Disk Space Per Node
  • Bytes Read/Write Per Second
  • IOPS
  • Disk avgqsz
  • Disk WeightedIO time
  • % File Descriptors Allocated
  • Time Spent on Disk IO

Memory Statistics

  • Used memory
  • Cached memory
  • Used Memory Percentage
  • JVM Thread Count
  • GC Count per sec
  • GC Duration
  • JVM Utilization (Heap/Non-Heap)
  • JVM Heap Utilization

Network Statistics

  • Network Received (bytes)
  • Network Transmitted (bytes)
  • NTP offset (milliseconds)

Filters

  • node_type: Filter by node type (broker, controller, etc.)

  • rack: Filter by rack location

  • host_id: Filter by specific host/node

  • mountpoint: Filter by disk mount point

  • partition: Filter by disk partition

  • Interface: Filter by network interface

Best Practices

CPU Monitoring

  • Monitor CPU usage to ensure it stays below 80% for production workloads
  • High I/O wait indicates disk bottlenecks
  • Monitor load average relative to CPU core count

Disk Monitoring

  • Keep disk usage below 85% to prevent performance degradation
  • Monitor IOPS and queue size for disk saturation
  • Track weighted I/O time for disk latency issues

Memory Monitoring

  • Monitor both system and JVM memory usage
  • Ensure adequate memory for page cache (cached memory)
  • Track JVM heap usage to prevent OutOfMemory errors

GC Monitoring

  • Monitor GC frequency and duration
  • Excessive GC activity indicates memory pressure
  • Different GC algorithms (G1, ZGC, Shenandoah) have different characteristics

Network Monitoring

  • Track network throughput for replication and client traffic
  • Monitor NTP offset to ensure time synchronization
  • High network usage may indicate replication storms

File Descriptors

  • Monitor file descriptor usage to prevent "too many open files" errors
  • Kafka requires many file descriptors for log segments and network connections