Apache Cassandra™ 5: The features that really count.

By Johnny Miller

The open-source distributed database platform Apache Cassandra is advancing rapidly by bringing numerous significant enhancements to versions 5.0 and 5.1. These upgrades introduce faster performance, expanded functionality, and operational improvements. Here’s a preview of the key features coming to Cassandra 5.0.x and 5.1.x that are going to make the biggest splash.

Cassandra 5.0.x

Trie Memtables and Trie SSTables

Cassandra Enhancement Proposals CEP-19 and CEP-25 change the underlying data structures for the in-memory memtables and on-disk SSTables. And the benefits are excellent!

Improved query performance
Higher write throughputs
Better wide partition support
Reduced memory consumption
Efficient garbage collections
Efficient SSTable compactions
Less disk usage than B-Trees
Higher performance lookups, union, and intersection than with B-Trees

Vector Search for AI Use Cases

There is no denying the buzz around Artificial Intelligence (AI) is here to stay. To help empower current and future workloads, Cassandra Enhancement Proposal CEP-30 brings Approximate Nearest Neighbor (ANN) Vector Search. This new functionality uses Storage Attached Indexes and the new VECTOR CQL type. You can jump into the docs to start using Vector Search today.

Storage Attached Indexes

CEP-7 brings us Storage Attached Indexes (SAI) that consume about 14% of the space that SStable-Attached Secondary Indexes (SASI) consumed. During an Apache Cassandra Contributor Meeting, we learned that while querying against secondary indexes is now faster, secondary indexes in Cassandra still shine best when querying against a bounded number of partitions. If that’s your use case, start using SAI today!

Unified Compaction Strategy

The unified compaction strategy has been added to Cassandra 5.0 to make operator’s lives easier! This new compaction strategy combines the tiered and leveled compaction strategies into a single algorithm that can be configured at any time. This strategy also introduces the possibility of future improvements that can bring automatic tuning and smart optimizations. However, although this is a great new feature I would advise testing thoroughly before adopting in production. Compaction is one of the main functions of Cassandra and if it is not used correctly it can consume a lot of resources at the expense of queries.

Added CQL Functions

Now, in 5.0, we get the addition of 5 math functions (abs, exp, log, log10, and round) and the ability to run aggregation functions over collections (count, min/max, sum/avg, keys, and values). Now, we can push some of the data processing onto the Cassandra server side and minimize transport time and bandwidth usage to allow for other queries that require complete datasets for client-side processing to enjoy lower latencies.

Support for Java 17

Cassandra is written in Java, runs on a Java Virtual Machine (JVM) and adopting JDK 17 has been a long time coming, but this is definitely a welcome change. Apart from the obvious improvements to Java and long term support (LTS) it brings a load of new options around JVM garbage collection (GC). It enables the use of low latency garbage collectors like ZGC and Shenandoah which provide sub-millisecond pause times and smoother performance for Cassandra and brings performance improvements in throughput and lower GC overhead compared to older Java versions.

Data Masking

5.0 introduces a new dynamic data masking (DDM) capability that allows obscuring sensitive information in database columns. DDM works by defining “masks” on columns that transform the data when retrieved through SELECT queries. The underlying data is not changed, only the view presented to the user. Some examples of built-in masks include replacing values with a default, shuffling data, or partially redacting values. An obvious advantage of DDM is that it can help prevent accidental data exposure through queries.

CIDR Authorizer

The CIDR authorizer feature in 5.0 allows restricting user access to the database based on the client IP address range, specified using the Classless Inter-Domain Routing (CIDR) notation. It provides a way to control access in Cassandra using network-level permissions, ensuring only some IP ranges have access to specific data or operations. This is an important security enhancement for multi-tenant and public cloud deployments where network isolation is not guaranteed.

More Guardrails

The guardrails framework was introduced in Cassandra 4.1 to help avoid configuration and usage pitfalls. Guardrails allow defining soft and hard limits on certain database metrics and restrict usage of certain features. Cassandra 5.0 extends this framework further bringing more guardrails to increase reliability, availability and improve user experience such as avoiding catastrophic mistakes like dropping production-critical keyspaces or losing data. If you’re looking after Cassandra in production you definitely want to get a handle on how this can help you.

Cassandra 5.1.x

General Purpose Transactions

The Apache Cassandra committers introduced lightweight transactions (LWT) in 2013 with the release of Apache Cassandra 2.0.0. A new Paxos implementation (named v2) was implemented in 2022 in Cassandra 4.1 to improve the safety and performance of LWT operations.

Expected in Cassandra 5.1 will be an implementation of Accord, instead of Paxos, which provides general purpose transactions, or transactions that can operate over any set of keys. To learn more about the implementation, read the Cassandra Enhancement Proposal CEP-15.

` ` `

# expected in 5.1 via General Purpose Transactions

BEGIN BATCH

UPDATE tbl1 SET value1 = newValue1 WHERE partitionKey = k1

UPDATE tbl2 SET value2 = newValue2 WHERE partitionKey = k2 AND conditionValue = someCondition

APPLY BATCH

` ` `

Transactional Cluster Metadata

If a Cassandra split-brain has never bitten you, consider yourself lucky. In some cases, where half your nodes see a column and the other half do not, you can get intermittent read or write errors and possible data loss if schema resolution isn’t handled correctly. In other cases, nodes own divergent views of the ring, again leading to potential data loss.

In 5.1, Cassandra ensures the cluster’s metadata moves forward in lockstep across all nodes. These deterministic updates will make Cassandra operators’ lives easier and give developers more confidence that their schema changes were applied correctly.

Learn more about the other exciting features and improvements coming to Apache Cassandra 5.0 and 5.1 by visiting the project’s prerelease documentation. To learn more about some of the Apache Cassandra committers’ favorite new features, check out this article. And for those who want the highest level of detail on upcoming releases, take advantage of NEWS.txt for the highlights and CHANGES.txt for the complete list of Jira tickets that make their way into each release.

More Information

AxonOps is a one-stop operations tool for Apache Cassandra. Through a single lens you can monitor, maintain and backup your Cassandra cluster. Learn more about AxonOps and get instant access to your own demo sandbox here.

← Previous Post

Apache Cassandra™ 5: The features that really count.

Cassandra 5.0.x

Trie Memtables and Trie SSTables

Vector Search for AI Use Cases

Storage Attached Indexes

Unified Compaction Strategy

Added CQL Functions

Support for Java 17

Data Masking

CIDR Authorizer

More Guardrails

Cassandra 5.1.x

General Purpose Transactions

Transactional Cluster Metadata

More Information

Latest Articles

Monitoring Cassandra: The Cost of Collecting Metrics

AxonOps Review – An Operations Platform for Apache Cassandra

Top 5 Disciplines for Cassandra Best Practice

Legals

Quick Links

Contact

124 City Road, London, EC1V 2NX

+44(0)203 603 6250

[email protected]