Monitoring Cassandra: The Cost of Collecting Metrics

Author: Joaquin Casares

After a brief stint away from the Apache Cassandra community, I’m thrilled by the transformative advancements and innovations that have recently shaped the Apache Cassandra ecosystem. Notably, my former colleagues at The Last Pickle, a longstanding pillar within the Cassandra community, were acquired by DataStax. Additionally, I’m excited to learn about the entrepreneurial updates of Hayato Shimizu and Johnny Miller. Together, they founded two innovative companies: first their consulting firm Digitalis.io and now AxonOps with another former DataStax employee, John Glendenning, which focuses on managing Apache Cassandra clusters.

I first started working with Apache Cassandra in 2011 at DataStax. On my third day, I was restarting servers for Netflix as a Support Engineer. During my career I’ve also been a Software Engineer-in-Test, Demo Engineer, Internal API Engineer, and Consultant, all focused on Cassandra. I frequently worked on monitoring and metrics for unrelated use cases centred around general visibility. At The Last Pickle, I worked with Alain Rodriguez to create the first version of the official DataDog dashboards which I presented at DataDog Summit 2017. A few months later, I presented enhanced Prometheus dashboards at Data Day Austin 2018.

The Metrics of Metric Collectors

Over the years I’ve gained a respect for creating dashboards that are easy to read and understand. I was the primary point of contact with 200+ organizations during my time at DataStax and most notably worked closely with T-Mobile during my time at The Last Pickle when I was not diagnosing 800-node clusters across at least 6 geo-distributed split-brained data centers. Through my experience dealing with multiple clusters and teams across various industries, I’ve noticed that the teams that had well-designed dashboards were always quickest to resolve issues. But I’ve also seen cases where the culprit for a near-meltdown was the metrics system that was entrusted to provide visibility during times of crisis.

Now, understanding the intricacies of an efficient, non-intrusive metrics system is always in the back of my mind. I’ve encountered scenarios where the choice between running blind or shutting down the system became imminent due to the destructive hindrance of the monitoring system in place.

The more I looked into AxonOps, the more the design philosophy of their agent intrigued me. The emphasis on a lightweight, non-intrusive agent capable of seamless integration with the Cassandra process promised a refreshed approach to metric collection.

DataStax open-sourced MCAC and I used the JMX Exporter’s default Cassandra configuration during my time at The Last Pickle, so I figured to run a quick test between the AxonOps agent, MCAC exporter, and JMX Exporter.

Test Environment

Infrastructure

3 Cassandra clusters of 3 nodes each were spun up on AWS as m7i.large instance types. This way, all tests could be run in parallel for quick data collection.

*A screenshot of a terminal session with 9 SSH connections to different machines, each column being a unique cluster.*

This setup can be seen in early startup production environments and internal projects. From experience, I’ve seen a startup’s traffic-serving production cluster last about a year without issues on m1.large instance types before needing to upgrade to m1.xlarge instance types.

Cassandra Version

Cassandra 4.1.3 was used to ensure the agents were tested against all recent bug fixes and improvements.

Cassandra 4.1.3 was the latest release at the time this test was conducted.

Java Version

OpenJDK 11.0.2 was used for all Cassandra nodes in this test.

Metric Collection

CPU and network metrics were collected at the Linux level to ensure reporting accuracy. The following two commands were used:

# cpu metrics
dstat --time --load --cpu-adv --top-cputime --cpu-use

# network metrics (Prometheus is located at 10.0.1.6)
iftop -B -f "dst host agents.axonops.cloud"
iftop -B -f "dst net 10.0.1.6/32"

Test Configurations

AxonOps Agent Configurations

I used the Axon Ops Starter plan which supports monitoring up to 6 Cassandra nodes for free. The AxonOps Agent configuration’s axon-server.hosts and axon-agent.org values were provided by the Agent Configuration wizard, which provides a preconfigured axon-agent.yml ready to be copied and pasted.

MCAC Exporter Configurations

All MCAC configurations were left at their defaults except for the collectd.conf’s Interval and Timeout settings which were lowered to 5 and the metric-collector.yaml’s metric_sampling_interval_in_seconds value was changed from 30 to 5 to match AxonOps’ default scrape resolution.

You can find the 1-line setup guide I followed in the README.

JMX Exporter Configurations

The JMX exporter had a simple installation guide and an example file that was copied onto the local config.yaml. This was the simplest exporter to set up!

For this exporter, each time Prometheus scrapes the /metrics endpoint the agent will collect and report metrics in real time. This means I’ll have to update the Prometheus configurations if I want to update the scraping frequency, but I’ll also have to be careful about how often metrics are scraped since too many requests could overload the JVM.

Prometheus Configurations

Within the local prometheus.yml, the default scrape_interval value of 1 minute was changed to scrape every 5 seconds to match AxonOps’ default granularity. As a result of an updated scrape_interval, scrape_timeout was changed from the default of 10 seconds to 2 seconds to ensure scrape requests wouldn’t overlap, causing undue load on Prometheus or Cassandra. All other configurations stayed the same.

While performing the baseline tests, scrapte_timeout was increased to match the scrape_interval of 5 seconds to avoid dropping metrics after proving a timeout of 2 seconds to be too short for the JMX Exporter.

After extending the timeout, the JMX Exporter routinely took close to 5 seconds to serve its metrics on an idle cluster, increasing the risk of dropped metrics when the cluster is under load.

Prometheus takes 4.6 seconds to scrape idle nodes monitored by the JMX Exporter.

The MCAC exporter is noteworthy for having a surprisingly quick scrape time about 20 times faster than that of the JMX Exporter, due to the local cache MCAC uses.

Baseline Metrics

In preparation for the upcoming tests, I wrote a quick Python script to create 1 keyspace-table pair in a for loop. I figured 100 tables would generate enough metrics for a noteworthy comparison.

While having a single table across 100 keyspaces is not normally seen in production, having around 100 tables is commonplace. While having a large number of keyspaces adds very little overhead to Cassandra and its metrics collection, it’s typically a large number of tables, even when empty, that can cause memory pressure issues for an idle Cassandra cluster.

Data Transfer over 12 Hours

After 100 tables were created on each of the 3 clusters, I left the idle clusters on for 12 hours to collect the amount of data each metrics collector exported from each node. AxonOps shined by transferring only 5 MB in an hour and 60 MB over 12 hours. The JMX Exporter transferred 7.5 MB in an hour and 92 MB over 12 hours, consuming 53% more bandwidth than the AxonOps agent.

MCAC, however, was a data firehose, sending 19.2 GB of data over 1 hour and 236 GB over 12 hours when monitoring 100 tables. By comparison, this is about 4,000x more data than what the AxonOps agent exports. This bandwidth would ideally be used to serve public requests, not gather internal health metrics while potentially increasing bandwidth charges.

The median amount of data transferred off each cluster. Each Cassandra cluster is identical, except for the metrics collector being used.

The raw **iftop** output after 12 hours of monitoring 1000/100/100 tables, respectively.

It is quickly apparent that the MCAC exporter web traffic is not compressed after capturing network traffic using the command tcpdump -A port 9103.

While nginx can be placed in front of MCAC’s web server, nginx is not part of the MCAC default setup. There’s no mention of enabling gzip, or any other compression, in the MCAC Readme, the packaged collectd.conf, nor the metric-collector.yaml.

It’s also worth noting that Prometheus is sending a gzip request in the HTTP header, and JMX Exporter honors this request while collectd used by the MCAC exporter does not.

Users of the MCAC exporter should be aware of this large bandwidth requirement, which could impede on the bandwidth available to serve production workloads.

CPU Load vs Number of Tables

The plan for the following measurements was to increase the number of tables, wait 15 minutes, take CPU metrics, and then increase the number of tables further.

However, the number of tables on the nodes with the JMX Exporter did not exceed 200 tables. It felt uncomfortable sending a test load to these nodes since the CPU usage was already far too high for an idle Cassandra cluster.

The JMX Exporter could barely monitor 100 tables, while the MCAC exporter could comfortably monitor 300 idle tables within the test environment. The AxonOps agent, on the other hand, could easily monitor more than 1000 tables while causing a 15-minute average load of around 0.33.

The majority of clusters typically have 100 tables or less. However, it is commonplace to have clusters with 1000 tables or more. I have personally worked in environments with these edge cases and being able to effectively monitor them with AxonOps is truly impressive. While this test is being run on nodes with 2 CPUs, the JMX Exporter may not be ideal for some enterprise-level configurations without proper performance testing.

It’s worth mentioning that the nodes using the MCAC Exporter were experiencing spikes related to the collectd process, which is listed as a known issue on DataStax’ website. Since these spikes were occurring outside the JVM, it allowed the Cassandra process to perform better than the nodes where the JMX Exporter is being used from within the JVM.

The raw ***dstat*** outputs for the three clusters with 1000/100/100 tables, respectively.

I lowered the number of tables on the cluster monitored by MCAC to 100 tables to match what the JMX exporter was monitoring while leaving the cluster monitored by AxonOps with 1000 tables in a better attempt to see load fluctuations for the rest of the tests.

Data Transfer over 24 Hours

I decided to run another test to see how much data is transmitted over 24 hours of uptime now that AxonOps is monitoring 1000 tables. This test would give me a chance to reconfirm my original measurements of MCAC’s exported data size.

AxonOps is also still remarkably lightweight on network bandwidth consumption, using 14% of the bandwidth JMX Exporter uses and 0.38% of the bandwidth of the MCAC Exporter. All while the AxonOps agent is still exporting metrics and logs for 10x more tables than the clusters being monitored by the MCAC and JMX Exporters.

The median amount of data egress 1 hour and 24 hours after tables were created on idle Cassandra nodes.

I wouldn’t advise using the MCAC exporter in a production environment until compression is enabled by default or you’ve added your own compression layer if bandwidth is a concern. The sole consumer of bandwidth off any database system should always be the data, not operational visibility tools.

Client-Side Overhead

Introducing visibility into your critical infrastructure at the expense of lower request throughputs is also not ideal. Next, we’ll test 3 workloads to compare the performance of an all-writes workload, an all-reads workload, and a mixed workload using the standard cassandra-stress tool.

Note that the AxonOps agent continues to be disadvantaged by having to monitor 1000 tables while the other exporters are only monitoring 100 tables.

Simple Workload: 1 Million Writes

Before we can test the runtime and throughput of 1 million reads, we must populate the database with at least 1 million writes. After a few runs to warm up the Cassandra process, the following results were captured.

Even though the AxonOps agent is monitoring 1000 tables, the cluster can process 1M writes without hitting an average CPU load of 1.

The cluster monitored by AxonOps was able to produce the highest write throughput since it was not bound by CPU like the other two clusters.

For 1 million writes, I was surprised to find the JMX Exporter dramatically lowering the write throughput for the cluster.

The cluster monitored by AxonOps completed 1 million writes, about twice as fast as the cluster being monitored by the JMX Exporter.

I ran all commands at the same time from a node with 4 CPUs. Technically, it’s not a fair assessment since whichever process is finished last could have a slight advantage in overall metrics since they can consume all CPU resources without contention. This possible advantage did not seem to be present.

Simple Workload: 1 Million Reads

After there was around 1GB in each Cassandra node, I began testing the cluster’s read throughput.

The cluster monitored by AxonOps kept its average CPU load below 1.

Usually, writing to Cassandra is limited by the CPU, while reading is limited by disk throughput. So, I thought the read speed would be about the same for the Cassandra clusters tracked by the AxonOps agent and MCAC exporter since 1 million read operations would apply disk pressure rather than CPU pressure.

However, remember that the MCAC exporter writes 5 GB of metrics to disk by default. This explains the relatively fast metric scrape times of 200 ms, but this metrics cache might also be the reason for Cassandra’s lower read throughput since we have two processes making disk I/O requests.

The cluster monitored by AxonOps produced the highest read-throughput.

These two simple workloads highlighted that within our test environment, the JMX Exporter:

could not monitor 100 tables without degrading performance
failed to serve computed metrics

The scrape_timeout of 5 seconds ended up being too short, and all requests for metrics would timeout. Even though Prometheus was no longer receiving metrics from the JMX Exporter, Prometheus was still scraping the JMX Exporter and causing the load on the Cassandra cluster to produce the requested metrics that would ultimately be lost behind a timeout.

The cluster monitored by AxonOps was able to complete 1M reads in about half the time as the cluster monitored by the JMX Exporter.

From experience, learning that your metrics are no longer being ingested during a P0 outage is highly stressful. Given that the JMX Exporter is doubling the read and write latency while also not providing any insights into the cluster, I definitely won’t be installing the JMX Exporter on my cluster.

Mixed Workload: 5M Writes and 1.1M Reads

For this final test to be more inclusive, I decided to lower the number of tables that the MCAC and JMX Exporters have to monitor to 50 tables. I left the AxonOps agent monitoring 1000 tables since it didn’t seem like it would be an issue.

I used these commands to generate the read and write load from a 4-vCPU node:

T=2; N=5000000; A=
tools/bin/cassandra-stress \
	write \
	n=${N} \
	-node 10.0.1.${A} \
	-mode native cql3 \
	user=cassandra password=cassandra \
	-rate threads=${T} \
	-log file=write-${A}-$(date +%s).log \
	-graph file=write-${A}-$(date +%s).html

T=2; N=1100000; A=
tools/bin/cassandra-stress \
	read n=${N} \
	-node 10.0.1.${A} \
	-mode native cql3 \
	user=cassandra password=cassandra \
	-rate threads=${T} \
	-log file=read-${N}-$(date +%s).log \
	-graph file=read-${N}-$(date +%s).html

Each of the 4 vCPUs rarely surpassed 70% utilization, indicating zero CPU contention for the load generators.

It was surprising to see that within 7 minutes, 1 GB of data was transferred by the MCAC exporter while only 4 MB were sent by the AxonOps agent.

Within 7 minutes, the MCAC exporter had exported 1GB of data.

The mixed workload finally caused the cluster monitored by AxonOps to have an average CPU load higher than 1, denoting a slight CPU contention.

Slight CPU contention is now seen on the cluster monitored by AxonOps, while CPU contention is twice as high on the other two Cassandra clusters.

The cluster monitored by AxonOps continues to outshine the throughput of the other two clusters in mixed workload scenarios.

Given that the JMX Exporter caused reads and writes to be slower, it comes as no surprise that the JMX Exporter caused mixed workload throughput to be equally taxed.

Testing Artifacts: Data Load

Throughout the test setup process, the load tests were started and manually terminated multiple times. Since the cluster monitored by AxonOps accepted a higher write throughput, it ended up having almost twice as much data as the cluster monitored by the JMX Exporter. So even while the cluster monitored by AxonOps held twice as much data while housing 20x as many tables, the cluster still processed twice as many reads and writes as the clusters monitored by the MCAC and JMX Exporter.

The cluster monitored by AxonOps housed about twice as much data as the other two clusters due to its high write throughputs.

Honorable Mentions

DataDog

I also considered testing the DataDog agent as well. I decided not to look further into DataDog since the DataDog agent would only be collecting node-level data and not the table-level data that is useful in quickly spotting issues with resource-hungry tables and workloads. While I trust DataDog’s reliability and ability to offer org-wide metrics, if Cassandra lies in your critical path, you would ideally also capture Cassandra table-level metrics to be able to quickly and effectively diagnose P0’s.

Instaclustr

Instaclustr was not evaluated at this time since multiple attempts to create 100 tables failed with BufferOverflowExceptions within the Cassandra logs. When the JMX and MCAC Exporters were monitoring 100 tables, the CPU load was higher but the metrics continued reporting. When the Instaclustr exporter encountered the BufferOverflowException, metrics reporting stopped completely.

Further tests showed that the Instaclustr agent was able to monitor at least 50 tables within the test environment, but each test failed with the same Cassandra-side BufferOverflowException before 75 tables were created.

Interestingly enough, Instacluster’s documentation has no recommendations to limit the number of tables per cluster, while their cassandra-exporter documentation claims to support 1000+ tables exporting 174,000 metrics.

The nodes consumed about 1.5 GB of the allotted 8 GB heap, the CPU consumption was low, and they were all running the same Java versions as the rest of the test clusters.

Conclusion

It should first be noted that AxonOps had a few disadvantages throughout the tests as follows:

The AxonOps agent was:
- Monitoring 1000 tables, instead of 50 or 100 tables
The Cassandra cluster that was monitored by the AxonOps had:
- 1000 tables, instead of 50 or 100 tables
- 50-100% more data load, due to a higher write throughput in timed tests

The fact the AxonOps agent still managed to outperform the MCAC exporter and the JMX Exporter is impressive.

The cluster being monitored by the JMX Exporter only had 50 tables, while the cluster monitored by the AxonOps agent was still able to:

process 2x as many writes
perform 40% quicker reads
continue providing updated metrics every 5 seconds

The write throughput of the MCAC exporter looks comparable to the AxonOps agent when the number of tables being monitored is kept low. However, the following issues with the MCAC exporter are noteworthy when making final comparisons since the MCAC exporter:

cannot monitor more than 300 keyspace/table pairs, within this test environment
sends 220-260x more data than the AxonOps agent
shows a 20% drop in read-throughput, likely due to the on-disk metrics caching enabled by default

These results speak to the lightweight, unobtrusive nature of the AxonOps agent that has been built from the ground up with metric collection optimizations, compression, and resource usage being at the center of the design process and zero-configuration installation.

I’m looking forward to using AxonOps on my current Cassandra cluster, and the next. It’s nice not having to trade visibility for lower throughputs with a higher bandwidth bill.

The AxonOps agent is the clear winner in this specific test.

← Previous Post Next Post →