Cutting Kafka Costs with Strimzi and AxonOps

So you chose self-hosted Kafka

If you read Kafka Cost Comparison 2026: Self-Hosted vs Amazon MSK vs Confluent Cloud, the next question is obvious: how do you actually self-host Kafka and end up with a better experience than MSK or Confluent Cloud?

This post is about how to make that self-hosted choice work well in practice. Getting Kafka onto Kubernetes is only part of the job; the harder part is ending up with a platform that engineers can operate comfortably once it is in production. Strimzi and AxonOps divide that work cleanly. Strimzi handles lifecycle reconciliation, including brokers, node pools, version changes, and the Kubernetes resources that define the cluster. AxonOps handles the day-2 work, including broker health, consumer lag, topics, ACLs, logs, and the workflows engineers use during routine operations and incidents.

This post walks through the AxonOps Strimzi integration published in the axonops-containers repository and shows how to bring up a Strimzi-managed Kafka cluster that reports into AxonOps.

If you want the broader operating model behind this setup, Running Kafka at Scale Without a Platform Team covers the control-plane side in more detail.

The GitHub repository is the source of truth for this integration. It contains the AxonOps-enabled Strimzi image, the startup wrapper, the example manifests, and the cloud deployment examples. If you want to try this yourself, start with:

Deployment architecture

At a high level, the deployment is cleanly split into two layers. Strimzi watches the Kafka custom resources and reconciles controller and broker node pools inside Kubernetes. The AxonOps-enabled Kafka image then starts the AxonOps agent inside those same pods, so metrics, logs, and operational metadata flow out to AxonOps without changing how Strimzi manages the cluster.

Deployment architecture showing Strimzi operator managing Kafka controller and broker node pools in Kubernetes, with AxonOps agents inside the pods sending metrics and logs to the AxonOps platform
Strimzi remains the reconciliation layer for Kafka on Kubernetes. AxonOps sits above that layer as the monitoring and management layer, with the agent embedded into the Kafka image and started inside the broker and controller pods.

There are four practical points to take from this layout:

  • Strimzi still owns the Kafka lifecycle and Kubernetes reconciliation.
  • The AxonOps agent starts inside each Kafka pod rather than as a separate external scraper.
  • Broker and controller roles are identified through the KAFKA_NODE_TYPE environment variable in each node pool.
  • Engineers work from AxonOps for monitoring and day-2 operations while leaving pod orchestration to Strimzi.

What teams usually want from a Kafka platform

Teams moved toward MSK and Confluent for understandable reasons: easier rollouts, a cleaner upgrade story, built-in monitoring, simpler security administration, and less time spent assembling Kafka tooling around the cluster.

The important point here is that a self-hosted Kafka platform no longer has to give up those qualities. With Strimzi looking after the Kubernetes and broker lifecycle, and AxonOps covering monitoring, alerting, logs, lag, topics, ACLs, and AI-assisted diagnosis, most of the practical reasons teams moved to MSK or Confluent are now covered in an open self-hosted model as well. What remains with self-hosting is the part some teams actively want: control over the infrastructure, the Kafka version, the network boundaries, and the cost profile.

Most of the practical reasons teams buy MSK or Confluent can now be covered by a Strimzi plus AxonOps deployment as well.

Capability teams usually expect from MSK/ConfluentMSK/ConfluentStrimzi + AxonOps
Broker lifecycle managementYesYes, via Strimzi
Monitoring, lag, logs, and alertingYesYes, via AxonOps
Topic and ACL administrationYesYes, in AxonOps
Terraform-driven platform managementYesYes, with Terraform
Faster diagnosis during incidentsYesYes, with AI-assisted diagnosis
Open Kafka with infrastructure controlNo, not in the same wayYes

What the AxonOps Strimzi image changes

The AxonOps Strimzi image starts from the official Strimzi Kafka base image and adds the AxonOps agent packages during the image build. In the current Dockerfile, the image:

  • installs the core axon-agent
  • installs the Kafka-specific AxonOps agent package
  • injects an AxonOps wrapper into the Strimzi Kafka startup scripts
  • creates a persistent AxonOps state location under Kafka data volume 0
  • starts the AxonOps agent when the broker or controller process starts

That last detail is important. The integration is not trying to replace Strimzi’s reconciliation loop. Strimzi still owns the Kafka process lifecycle. The AxonOps wrapper simply ensures the Java agent and supporting process are started in the same pod so metrics and logs are reported automatically.

The current wrapper script does three notable things:

  1. It appends the AxonOps Java agent to KAFKA_OPTS.
  2. It copies AxonOps state into /var/lib/kafka/data-0/axonops so the agent state survives normal restarts.
  3. It handles role-specific details such as controller.log symlinking for KRaft controllers.

That is why the example broker node pool contains a comment saying AxonOps requires a storage volume with index 0. The wrapper expects that layout.

Prerequisites

The current Strimzi integration in the AxonOps containers repository assumes:

  • Kubernetes 1.24+
  • Helm 3.x
  • kubectl
  • a running AxonOps environment and API key
  • Strimzi in KRaft mode

The example manifests in the development branch are based on:

  • Strimzi 0.50.0
  • Kafka 4.1.1
  • separate controller and broker node pools

ZooKeeper mode is not the target here. The published integration is KRaft-first.

Install the Strimzi operator

There are two sensible ways to deploy this integration. The repository examples use rendered YAML with envsubst. If your platform is already managed through Terraform, the same deployment can live in that workflow as well.

Option 1: follow the published manifests

Start by installing Strimzi itself:

helm repo add strimzi https://strimzi.io/charts/
helm install my-strimzi-kafka-operator strimzi/strimzi-kafka-operator \
  --version 0.46.0 \
  --set watchAnyNamespace=true

Then create the namespace for Kafka:

kubectl create namespace kafka

This is the quickest way to follow the published examples and it matches the rest of this post.

Option 2: manage the deployment in Terraform

If you want the deployment managed through Terraform, the clean split is:

  • helm_release for the Strimzi operator
  • kubernetes_secret_v1 for the AxonOps settings
  • kubernetes_manifest for the logging ConfigMap, controller pool, broker pool, and Kafka resource

A minimal starting point looks like this:

terraform {
  required_providers {
    helm = {
      source  = "hashicorp/helm"
      version = "~> 2.14"
    }
    kubernetes = {
      source  = "hashicorp/kubernetes"
      version = "~> 2.30"
    }
  }
}

provider "helm" {
  kubernetes {
    config_path = var.kubeconfig_path
  }
}

provider "kubernetes" {
  config_path = var.kubeconfig_path
}

resource "kubernetes_namespace" "kafka" {
  metadata {
    name = var.kafka_namespace
  }
}

resource "helm_release" "strimzi" {
  name       = "strimzi-operator"
  repository = "https://strimzi.io/charts/"
  chart      = "strimzi-kafka-operator"
  version    = "0.46.0"
  namespace  = kubernetes_namespace.kafka.metadata[0].name

  set {
    name  = "watchAnyNamespace"
    value = "true"
  }
}

resource "kubernetes_secret_v1" "axonops" {
  metadata {
    name      = "axonops-config"
    namespace = kubernetes_namespace.kafka.metadata[0].name
  }

  type = "Opaque"

  string_data = {
    AXON_AGENT_CLUSTER_NAME = var.cluster_name
    AXON_AGENT_ORG          = var.axonops_org
    AXON_AGENT_KEY          = var.axonops_key
    AXON_AGENT_SERVER_HOST  = var.axonops_host
    AXON_AGENT_SERVER_PORT  = tostring(var.axonops_port)
    AXON_AGENT_TLS_MODE     = "TLS"
  }
}

From there, render the Strimzi resources as templates and apply them with kubernetes_manifest:

resource "kubernetes_manifest" "controller_pool" {
  depends_on = [helm_release.strimzi, kubernetes_secret_v1.axonops]

  manifest = yamldecode(templatefile("${path.module}/manifests/kafka-node-pool-controller.yaml.tftpl", {
    kafka_namespace       = var.kafka_namespace
    cluster_name          = var.cluster_name
    kafka_version         = var.kafka_version
    kafka_container_image = var.kafka_container_image
    axonops_secret_name   = kubernetes_secret_v1.axonops.metadata[0].name
  }))
}

resource "kubernetes_manifest" "broker_pool" {
  depends_on = [helm_release.strimzi, kubernetes_secret_v1.axonops]

  manifest = yamldecode(templatefile("${path.module}/manifests/kafka-node-pool-brokers.yaml.tftpl", {
    kafka_namespace       = var.kafka_namespace
    cluster_name          = var.cluster_name
    kafka_version         = var.kafka_version
    kafka_container_image = var.kafka_container_image
    axonops_secret_name   = kubernetes_secret_v1.axonops.metadata[0].name
  }))
}

resource "kubernetes_manifest" "kafka_cluster" {
  depends_on = [
    helm_release.strimzi,
    kubernetes_manifest.controller_pool,
    kubernetes_manifest.broker_pool
  ]

  manifest = yamldecode(templatefile("${path.module}/manifests/kafka-cluster.yaml.tftpl", {
    kafka_namespace       = var.kafka_namespace
    cluster_name          = var.cluster_name
    kafka_version         = var.kafka_version
    kafka_container_image = var.kafka_container_image
  }))
}

Inside the broker and controller templates, reference the AxonOps values from the secret rather than writing them inline:

env:
  - name: KAFKA_NODE_TYPE
    value: kraft-broker
  - name: AXON_AGENT_CLUSTER_NAME
    valueFrom:
      secretKeyRef:
        name: ${axonops_secret_name}
        key: AXON_AGENT_CLUSTER_NAME
  - name: AXON_AGENT_ORG
    valueFrom:
      secretKeyRef:
        name: ${axonops_secret_name}
        key: AXON_AGENT_ORG
  - name: AXON_AGENT_KEY
    valueFrom:
      secretKeyRef:
        name: ${axonops_secret_name}
        key: AXON_AGENT_KEY

The rest of this post follows the published envsubst path from the repository examples, but the same values and resource ordering apply if you are using Terraform.

Prepare the Strimzi and AxonOps settings

The cloud examples in the repository use a sourced environment file and envsubst to render the manifests. The current strimzi-config.env in the example set includes values such as:

export KAFKA_NAMESPACE=kafka
export STRIMZI_CLUSTER_NAME=axonops-kafka
export KAFKA_VERSION=4.1.1
export KAFKA_CONTAINER_IMAGE=ghcr.io/axonops/strimzi/kafka:0.50.0-4.1.1-2.0.19-0.1.12

export STRIMZI_BROKER_REPLICAS=6
export STRIMZI_CONTROLLER_REPLICAS=3

export AXON_AGENT_CLUSTER_NAME=$STRIMZI_CLUSTER_NAME
export AXON_AGENT_ORG=example
export AXON_AGENT_KEY=CHANGEME
export AXON_AGENT_SERVER_HOST=agents.axonops.cloud
export AXON_AGENT_SERVER_PORT=443
export AXON_AGENT_TLS_MODE=TLS

For a first deployment, the fields you need to set carefully are:

  • STRIMZI_CLUSTER_NAME
  • KAFKA_CONTAINER_IMAGE
  • AXON_AGENT_CLUSTER_NAME
  • AXON_AGENT_ORG
  • AXON_AGENT_KEY
  • AXON_AGENT_SERVER_HOST
  • AXON_AGENT_SERVER_PORT
  • AXON_AGENT_TLS_MODE

Then load them into your shell:

source strimzi-config.env

If you are following the published examples directly, use examples/strimzi/cloud/strimzi-config.env as the starting point rather than recreating the file by hand.

The key Strimzi resources

The example deployment is made of four core resources:

  • one Kafka resource
  • one controller KafkaNodePool
  • one broker KafkaNodePool
  • one logging ConfigMap

The Kafka resource enables both node pools and KRaft:

metadata:
  annotations:
    strimzi.io/node-pools: enabled
    strimzi.io/kraft: enabled
spec:
  kafka:
    version: ${KAFKA_VERSION}
    image: ${KAFKA_CONTAINER_IMAGE}

That image line is the key hand-off. You are still deploying a normal Strimzi-managed Kafka cluster, but the Kafka container image now includes the AxonOps agent components.

The broker and controller node pools then tell AxonOps what each pod is supposed to be. The examples do that with pod-level environment variables:

env:
  - name: KAFKA_NODE_TYPE
    value: kraft-broker
  - name: AXON_AGENT_CLUSTER_NAME
    value: "${AXON_AGENT_CLUSTER_NAME}"
  - name: AXON_AGENT_ORG
    value: "${AXON_AGENT_ORG}"
  - name: AXON_AGENT_SERVER_HOST
    value: "${AXON_AGENT_SERVER_HOST}"
  - name: AXON_AGENT_SERVER_PORT
    value: "${AXON_AGENT_SERVER_PORT}"
  - name: AXON_AGENT_TLS_MODE
    value: "${AXON_AGENT_TLS_MODE}"
  - name: AXON_AGENT_KEY
    value: "${AXON_AGENT_KEY}"

Controllers use the same pattern, except KAFKA_NODE_TYPE is set to kraft-controller.

One subtle point is worth calling out. The example directory also includes axonops-config-secret.yaml, but the node-pool manifests in the current development branch still render AxonOps values directly through envsubst. If you prefer to keep credentials in Kubernetes Secrets rather than rendered manifest values, adapt those env: entries to use valueFrom.secretKeyRef.

The exact files behind this section are all in the repository:

Deploy the manifests in the right order

The example cloud README uses the following order:

envsubst < axonops-config-secret.yaml | kubectl apply -f -
envsubst < kafka-logging-cm.yaml | kubectl apply -f -
envsubst < kafka-node-pool-controller.yaml | kubectl apply -f -
envsubst < kafka-node-pool-brokers.yaml | kubectl apply -f -
envsubst < kafka-cluster.yaml | kubectl apply -f -

The important dependency is that the Kafka resource must come after the node pools, because the cluster is declared with node pools already enabled.

Once applied, watch the pods come up:

kubectl get pods -n ${KAFKA_NAMESPACE} --watch

Rack awareness and topology spread

The example manifests are already laid out for multi-zone scheduling.

The Kafka resource sets:

rack:
  topologyKey: topology.kubernetes.io/zone

The controller node pool also uses topologySpreadConstraints against the same topology key. If you are on EKS, GKE, or AKS, that label is usually correct as-is. If your environment uses a different label, inspect the node labels first and change the manifests before deployment:

kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{.metadata.labels}{"\n\n"}{end}' | grep -E "topology|zone|region"

Logging and what AxonOps expects

The example kafka-cluster.yaml writes Kafka logs to /var/log/kafka/server.log. That matters because the AxonOps integration expects Kafka logs to live in the pod filesystem where the agent can collect them.

For KRaft controllers, the wrapper also creates a controller.log symlink because Strimzi logs controller output as server.log while AxonOps expects the controller log name separately.

That arrangement keeps the logging model simple:

  • Strimzi still starts Kafka normally
  • Kafka writes to /var/log/kafka
  • AxonOps reads the logs from the broker or controller pod

If you want to go deeper on Kafka monitoring and log collection once the cluster is live, the AxonOps Kafka monitoring page shows the operational surface that sits on top of this deployment.

Verifying that AxonOps is connected

The first checks are still Kubernetes checks:

kubectl get pods -n ${KAFKA_NAMESPACE}
kubectl describe pod <pod-name> -n ${KAFKA_NAMESPACE}
kubectl logs <pod-name> -n ${KAFKA_NAMESPACE}

Then inspect the AxonOps processes inside a broker pod:

kubectl exec <pod-name> -n ${KAFKA_NAMESPACE} -- ps aux | grep axon
kubectl exec <pod-name> -n ${KAFKA_NAMESPACE} -- tail -f /var/log/axonops/axon-agent.log

If the pod is healthy but nothing appears in AxonOps, the usual causes are:

  • wrong AXON_AGENT_KEY
  • wrong AXON_AGENT_SERVER_HOST
  • wrong AXON_AGENT_SERVER_PORT
  • wrong AXON_AGENT_TLS_MODE
  • blocked egress from the Kubernetes cluster to AxonOps

Optional Kafka Connect

The example set also includes a kafka-connect.yaml manifest. The development branch README describes Kafka Connect support as beta, with Connect workers reporting to AxonOps using the connect node type.

That is useful if you want AxonOps to cover more than brokers. It means you can keep:

  • brokers
  • consumer groups
  • ACLs
  • topics
  • Schema Registry
  • Connect workers

inside one operational surface rather than splitting them across separate dashboards and scripts.

If you are looking at the broader Kafka operating picture around those domains, the Kafka overview page and Schema Registry feature page are the natural next steps.

Common issues during first setup

Three problems come up repeatedly during first-time Strimzi deployments.

1. Storage class mismatch

If your broker or controller PVCs stay pending, the most common cause is that STRIMZI_BROKER_STORAGE_CLASS or STRIMZI_CONTROLLER_STORAGE_CLASS does not match a real StorageClass in the cluster.

2. Wrong topology key

If controllers or brokers do not spread the way you expect, check the node labels rather than guessing. A correct topologyKey is required for both rack awareness and pod spread behaviour.

3. Volume 0 changed or removed

The current AxonOps wrapper expects to persist agent state under Kafka data volume 0. If you redesign the broker storage layout, keep that requirement in mind.

Why this split works operationally

The real value in this setup is not that it puts another agent inside a Kafka pod. The value is that it keeps a clean separation of responsibilities.

Strimzi owns:

  • Kafka process lifecycle
  • node pools
  • upgrades
  • storage layout
  • Kubernetes-native reconciliation

AxonOps owns:

  • broker and cluster monitoring
  • logs
  • topic and ACL operations
  • alerting
  • consumer-group visibility
  • the operational surface engineers use each day

That is a stronger model than trying to force Strimzi to become an operations console, and it is also stronger than running Kafka on Kubernetes with no control plane above it.

What you get once the cluster is live

Once the cluster is reporting into AxonOps, engineers are not left staring at raw Kubernetes resources and ad hoc shell commands. They get a proper UI for brokers, consumer groups, topics, ACLs, logs, alerts, and day-2 Kafka work. That is the part that makes this deployment feel much closer to the experience teams expect from MSK or Confluent, while still keeping Kafka open and self-hosted.

AxonOps Kafka brokers view showing a GUI for inspecting and managing a live Kafka cluster
Once the Strimzi cluster is connected, AxonOps gives engineers a proper GUI for managing the cluster rather than leaving them to work entirely from Kubernetes objects and CLI commands.

Managing topics and ACLs with Terraform

If you want Kafka administration to stay in the same Git-driven workflow as the cluster deployment, the AxonOps Terraform provider supports Kafka topics and ACLs directly.

A provider block for a self-hosted AxonOps deployment looks like this:

provider "axonops" {
  org_id           = var.axonops_org_id
  api_key          = var.axonops_api_key
  axonops_host     = var.axonops_host
  axonops_protocol = "https"
  token_type       = "Bearer"
}

A Kafka topic can then be declared like this:

resource "axonops_kafka_topic" "orders" {
  cluster_name       = var.cluster_name
  name               = "orders"
  partitions         = 12
  replication_factor = 3

  config = {
    cleanup_policy      = "delete"
    retention_ms        = "604800000"
    min_insync_replicas = "2"
  }
}

Producer and consumer access can be managed in the same way:

resource "axonops_kafka_acl" "orders_producer" {
  cluster_name          = var.cluster_name
  resource_type         = "TOPIC"
  resource_name         = axonops_kafka_topic.orders.name
  resource_pattern_type = "LITERAL"
  principal             = "User:orders-producer"
  host                  = "*"
  operation             = "WRITE"
  permission_type       = "ALLOW"
}

resource "axonops_kafka_acl" "orders_consumer" {
  cluster_name          = var.cluster_name
  resource_type         = "TOPIC"
  resource_name         = axonops_kafka_topic.orders.name
  resource_pattern_type = "LITERAL"
  principal             = "User:orders-consumer"
  host                  = "*"
  operation             = "READ"
  permission_type       = "ALLOW"
}

resource "axonops_kafka_acl" "orders_consumer_group" {
  cluster_name          = var.cluster_name
  resource_type         = "GROUP"
  resource_name         = "orders-consumer-group"
  resource_pattern_type = "LITERAL"
  principal             = "User:orders-consumer"
  host                  = "*"
  operation             = "READ"
  permission_type       = "ALLOW"
}

That gives you a consistent model across cluster deployment and Kafka administration: Strimzi manages the Kafka runtime in Kubernetes, AxonOps gives engineers a usable UI for the running platform, and Terraform can manage the Kafka objects that sit on top of it.

Conclusion

Self-hosting Kafka should no longer feel like the scary option. For most companies, Strimzi plus AxonOps now covers what they actually need while also giving them a materially better cost profile than managed Kafka: reliable cluster lifecycle management, monitoring, logs, lag visibility, topic and ACL administration, Terraform-friendly workflows, and a proper interface for running Kafka day to day.

The combination is straightforward:

  • Strimzi manages the cluster declaratively
  • the AxonOps-enabled image injects the Kafka agent at startup
  • the node pools tell AxonOps whether each pod is a controller or broker
  • AxonOps gives engineers the visibility and controls they need once the cluster is live

If you want to build from the published integration rather than reproducing it manually, start with these repository links:

If you want help deploying this pattern in your own environment, talk to us.