Lightweight Transaction Errors¶
This page covers common issues encountered when using Cassandra lightweight transactions (LWT), including Paxos-related problems and the pitfalls of mixing LWT with non-LWT operations.
LWT and Non-LWT Mixing Issues¶
Symptoms¶
- A
DELETEoperation returns success, but the row still exists when queried - An
UPDATEappears to succeed, but the data remains unchanged - Data inconsistencies when the same partition is accessed by both LWT and non-LWT operations
- Intermittent failures where operations "sometimes work" depending on timing
Root Cause¶
Paxos (the consensus protocol underlying LWT) uses its own hybrid-logical clock that is separate from the regular Cassandra timestamp mechanism. When you mix LWT operations (e.g., INSERT ... IF NOT EXISTS) with non-LWT operations (e.g., standard DELETE) on the same data:
- The LWT operation commits with a Paxos-managed timestamp
- The non-LWT operation uses a client-side or coordinator timestamp
- These timestamps are not coordinated, so the non-LWT operation may use a timestamp that appears "older" than the LWT commit
- Cassandra's last-write-wins semantics cause the non-LWT operation to be effectively ignored
This is not a bug—it is inherent to how Paxos ensures linearizability within LWT operations while remaining decoupled from standard writes.
Diagnostics¶
Step 1: Identify the operation pattern
Review your application code for patterns like:
-- LWT insert
INSERT INTO table (pk, data) VALUES (?, ?) IF NOT EXISTS;
-- Followed by non-LWT delete
DELETE FROM table WHERE pk = ?;
Step 2: Verify with tracing
Enable tracing to see the actual timestamps used:
TRACING ON;
-- Execute your LWT operation
INSERT INTO users (user_id, name) VALUES (123, 'test') IF NOT EXISTS;
-- Execute your non-LWT operation
DELETE FROM users WHERE user_id = 123;
-- Query to verify
SELECT * FROM users WHERE user_id = 123;
TRACING OFF;
Examine the trace output for timestamp discrepancies between operations.
Step 3: Check for timing-dependent behavior
If adding a delay between operations changes the outcome, you have confirmed an LWT/non-LWT mixing issue:
# If this pattern shows different results with/without sleep,
# you have an LWT mixing issue
session.execute("INSERT INTO t (pk) VALUES (1) IF NOT EXISTS")
time.sleep(1) # Remove this to reproduce the issue
session.execute("DELETE FROM t WHERE pk = 1")
Resolution¶
Option 1: Use LWT consistently (RECOMMENDED)
If data is inserted with IF NOT EXISTS, delete it with IF EXISTS:
-- Insert with LWT
INSERT INTO users (user_id, name) VALUES (123, 'test') IF NOT EXISTS;
-- Delete with LWT (consistent)
DELETE FROM users WHERE user_id = 123 IF EXISTS;
This ensures both operations go through Paxos and use the same clock domain.
Option 2: Avoid LWT if not required
If you do not need the linearizability guarantees of LWT, use standard operations throughout:
-- Standard insert (no IF condition)
INSERT INTO users (user_id, name) VALUES (123, 'test');
-- Standard delete
DELETE FROM users WHERE user_id = 123;
Option 3: Accept the trade-offs (NOT RECOMMENDED)
If you understand the risks and cannot change your data model:
- Add application-level delays between LWT and non-LWT operations (fragile, timing-dependent)
- Implement verification loops that retry operations if the expected state is not achieved
- Accept that some operations may silently fail
Delays Are Not a Solution
Adding sleep() or delays between operations is not a reliable solution. The required delay is timing-dependent and varies based on cluster load, network conditions, and replication factor. What works in testing may fail in production.
Prevention¶
- Establish LWT boundaries: Decide at the table or partition level whether data will be managed with LWT. Document this decision.
- Use LWT for all operations on LWT-managed data: If
INSERT ... IF NOT EXISTSis used, also useUPDATE ... IFandDELETE ... IF EXISTS. - Code review for mixing: Add linting or code review checks to catch LWT/non-LWT mixing on the same tables.
- Consider Accord: Cassandra 5.0+ introduces Accord transactions which may provide better semantics for complex transactional patterns.
Paxos Contention Errors¶
Symptoms¶
- High latency on LWT operations
CasWriteTimeoutExceptionerrors- Increased Paxos round trips visible in tracing
- LWT throughput significantly lower than expected
Root Cause¶
When multiple clients attempt LWT operations on the same partition simultaneously, Paxos contention occurs. Each conflicting operation must retry with a higher ballot number, causing:
- Multiple round trips per operation
- Increased latency
- Potential timeouts under heavy contention
Diagnostics¶
-- Check CAS metrics
nodetool tpstats | grep -i paxos
-- Enable tracing to see contention
TRACING ON;
UPDATE accounts SET balance = 100 WHERE id = 1 IF balance > 0;
TRACING OFF;
Look for multiple PREPARE/PROMISE cycles in the trace output.
Resolution¶
- Reduce contention: Redesign data model to spread hot partitions
- Increase timeouts: Adjust
cas_contention_timeoutincassandra.yamlif timeouts are premature - Implement backoff: Add exponential backoff in application retry logic
- Batch related operations: Use LWT batches to combine multiple operations on the same partition
LWT Timeout Behavior¶
Symptoms¶
WriteTimeoutExceptionduring LWT operations- Uncertainty about whether the operation was applied
- Duplicate data after retry attempts
Root Cause¶
When an LWT times out, the operation may have partially completed (Paxos PREPARE succeeded but PROPOSE failed, or PROPOSE succeeded but acknowledgment was lost). The outcome is undefined.
Resolution¶
After a timeout, you MUST read to verify state before retrying:
try {
session.execute(lwtStatement);
} catch (WriteTimeoutException e) {
// Read to determine actual state
Row current = session.execute(readStatement).one();
if (current == null || !expectedState(current)) {
// Safe to retry
session.execute(lwtStatement);
}
}
See Failure Semantics for detailed timeout handling guidance.
Related Documentation¶
- Lightweight Transactions Reference - Complete LWT documentation
- Consistency Levels - Understanding SERIAL and LOCAL_SERIAL
- Troubleshooting Overview - General troubleshooting framework