Busy Dev's Checklist for Lock-Free Data Structures in C++

Lock-free data structures are a powerful tool in system programming, but they come with a steep learning curve. This checklist is for the busy developer who needs a practical, no-nonsense guide to designing, implementing, and maintaining lock-free code in C++. We'll cover the core concepts, common pitfalls, and decision criteria to help you avoid the most expensive mistakes.

1. Where Lock-Free Shows Up in Real Work

Lock-free data structures appear in high-performance systems where contention is high and latency must be predictable. Think of network routers processing millions of packets per second, real-time trading systems, game engines, or database transaction logs. In these environments, a lock can become a bottleneck that destroys throughput or introduces unacceptable jitter.

The fundamental promise is that no thread ever blocks waiting for another. Instead, threads retry operations until they succeed. This eliminates problems like priority inversion, convoying, and deadlock that plague lock-based designs. But the trade-off is complexity: you trade simpler reasoning for intricate atomic operations and memory ordering constraints.

We've seen teams adopt lock-free queues for message passing between cores, lock-free hash tables for caches, and lock-free stacks for memory pools. A common pattern is the single-producer, single-consumer (SPSC) queue, which is relatively simple and widely used. Multi-producer, multi-consumer (MPMC) variants are harder to get right.

Before you start, ask yourself: is the contention really that high? Profiling often reveals that locks are not the bottleneck. If your critical section is short and contention is low, a simple mutex may be faster and far easier to maintain. Lock-free is a tool, not a trophy.

Real-World Scenario: Packet Processing

In a network function virtualization (NFV) project, the team replaced a mutex-protected ring buffer with a lock-free MPMC queue. The mutex version showed 30% CPU utilization in spin-wait under high load. After switching to a lock-free design, CPU dropped to 5% and throughput doubled. However, the lock-free version took three iterations to get correct—the first two had subtle bugs that only appeared under extreme load.

This illustrates the key point: lock-free can deliver huge wins, but only if you invest in careful design, testing, and review.

2. Foundations That Are Often Misunderstood

Before writing any lock-free code, you need a solid grasp of three concepts: atomic operations, memory ordering, and the ABA problem. Each is a common source of bugs.

Atomic Operations

In C++, std::atomic provides the building blocks: load, store, compare-and-exchange (CAS), fetch-add, etc. These operations are indivisible—no thread sees a partial update. But atomicity alone is not enough. You must also consider memory ordering.

Memory Ordering

C++ offers six memory ordering options: relaxed, consume, acquire, release, acq_rel, and seq_cst. The default is seq_cst, which is easiest to reason about but can be expensive on some architectures. Many lock-free algorithms use acquire and release semantics to pair operations correctly.

A common mistake is using relaxed everywhere because it's fast. This breaks happens-before relationships and leads to data races that are nearly impossible to reproduce. Always start with seq_cst during prototyping, then relax only after proving correctness with a formal model or exhaustive testing.

The ABA Problem

The ABA problem occurs when a thread reads a value A, another thread changes it to B and then back to A, and the first thread's CAS succeeds even though the structure has changed. This is a classic pitfall in lock-free linked lists and stacks.

Solutions include tagged pointers (storing a version counter alongside the pointer) or using hazard pointers to prevent reclamation of nodes in use. We'll cover both later.

3. Patterns That Usually Work

Several lock-free patterns have been proven in production. Here are the ones we recommend for most use cases.

Lock-Free Stack (Treiber Stack)

The Treiber stack uses a single CAS on the head pointer. It's simple and works well under low to moderate contention. The main challenge is ABA: you need a tagged pointer or a safe memory reclamation scheme.

Lock-Free Queue (Michael-Scott)

The Michael-Scott queue is a linked-list-based MPMC queue. It uses two CAS operations per push/pop and requires hazard pointers or epoch-based reclamation. It's well-studied and has many production implementations.

Lock-Free Hash Table (Split-Ordered Lists)

For concurrent hash tables, split-ordered lists provide a lock-free design that handles resizing. The idea is to store entries in a linked list ordered by a recursive ordering of keys. This is more complex but avoids the need for global locks during resizing.

Read-Copy-Update (RCU)

RCU is a synchronization mechanism that allows readers to proceed without locks, while writers make a copy and update atomically. It's particularly effective for read-mostly workloads. The Linux kernel uses RCU extensively. C++ doesn't have a standard RCU implementation, but libraries like folly provide one.

Hazard Pointers

Hazard pointers are a memory reclamation scheme where each thread announces which nodes it is accessing. Before freeing a node, a thread checks that no other thread has it as a hazard. This is a common companion to lock-free data structures.

4. Anti-Patterns and Why Teams Revert

Many teams try lock-free, hit bugs, and revert to locks. Here are the most common anti-patterns we've seen.

Overusing CAS Without Backoff

Spinning on CAS in a tight loop can cause cache line bouncing and degrade performance. Always use an exponential backoff or a yield after a few retries. Some implementations use a pause instruction on x86.

Ignoring Memory Reclamation

Leaving memory leaks or using garbage collection in a systems context is a recipe for disaster. Without proper reclamation, you'll exhaust memory under load. Use hazard pointers, RCU, or epoch-based reclamation from the start.

Assuming Sequential Consistency Is Free

On x86, seq_cst is relatively cheap because the hardware already provides strong ordering. On ARM or PowerPC, it can be expensive. Profile on your target architecture. If you need to relax ordering, use a formal model or a tool like CDSChecker to verify correctness.

Writing Your Own Algorithm from Scratch

Unless you're a researcher, don't invent new lock-free algorithms. Use well-known ones from the literature or proven libraries. The chance of a subtle bug in a novel algorithm is high. Even experts get it wrong.

Not Testing Under Contention

A lock-free data structure that works with two threads may fail with 64. Use stress tests with thread sanitizers and address sanitizers. Run for hours on many cores. Bugs often appear only under high contention and specific interleavings.

5. Maintenance, Drift, and Long-Term Costs

Lock-free code is not write-once and forget. It requires ongoing attention as compilers, hardware, and usage patterns change.

Compiler and Hardware Evolution

C++ memory model guarantees are stable, but compiler optimizations can expose latent bugs. For example, a compiler might reorder operations in a way that breaks your assumptions if you used relaxed ordering. Always compile with the latest toolchain and run your stress tests.

Code Reviews Are Harder

Lock-free code is difficult to review because reasoning about interleavings is mentally taxing. Establish review guidelines: check that every atomic operation has a documented memory order, that CAS loops have backoff, and that reclamation is correct. Consider using model checking tools like CDSChecker or TLA+ for critical components.

Integration with Lock-Based Code

Mixing lock-free and lock-based code can introduce subtle deadlocks or data races. For example, if a lock-free queue passes pointers to objects protected by a mutex, you need to ensure the mutex is acquired before accessing the object. Document the contract clearly.

Long-Term Costs

The initial development time for a lock-free data structure is often 2-3x longer than a lock-based equivalent. Maintenance costs are also higher because bugs can be rare and hard to reproduce. If your project has a high turnover of developers, the knowledge may be lost. Consider whether the performance gain justifies these costs.

6. When Not to Use This Approach

Lock-free is not a universal solution. Here are situations where you should avoid it.

Low Contention

If your critical section is short and contention is low, a simple mutex is faster and easier. Lock-free adds overhead from CAS retries and memory ordering. Measure before you decide.

Complex Data Structures

Lock-free trees, graphs, or priority queues are extremely complex. Few production implementations exist. If you need a concurrent tree, consider using a lock-based approach with fine-grained locking, or use a database that handles concurrency for you.

Real-Time Systems with Hard Deadlines

While lock-free avoids blocking, it can still have unbounded retries. In a hard real-time system, you need deterministic worst-case execution time. Lock-free algorithms with retry loops can exceed deadlines. Use wait-free algorithms instead, or use a priority inheritance protocol with locks.

Prototyping or Short-Lived Code

If you're building a prototype that will be thrown away, use locks. Lock-free takes too long to get right. Only invest the effort when the code will be in production for years.

When You Lack Testing Infrastructure

Without stress testing tools (ThreadSanitizer, AddressSanitizer, model checkers), you're flying blind. If your project can't run these, stick with locks.

7. Open Questions and FAQ

We often get questions about practical aspects of lock-free programming. Here are answers to the most common ones.

Should I use `std::atomic` or compiler intrinsics?

Use std::atomic. It's portable and the compiler will generate optimal code for your target. Intrinsics are only needed for operations not exposed by the standard, like double-word CAS on 32-bit platforms.

How do I choose between hazard pointers and RCU?

Hazard pointers are more general and work for any data structure, but they require each thread to maintain a list of pointers. RCU is simpler for read-mostly workloads and has lower overhead for readers, but it requires a grace period mechanism and is harder to integrate with non-RCU code. If your workload is 90% reads, RCU is likely better.

Is lock-free always faster than a mutex?

No. Under low contention, a mutex can be faster because it avoids CAS retries. Under high contention, lock-free can scale better, but the crossover point depends on the hardware and the specific algorithm. Always benchmark with your actual workload.

What about wait-free algorithms?

Wait-free algorithms guarantee that every thread completes in a bounded number of steps. They are even harder to design than lock-free. For most practical purposes, lock-free is sufficient. Wait-free is reserved for hard real-time systems.

How do I test lock-free code?

Use stress tests with many threads running for hours. Enable ThreadSanitizer and AddressSanitizer. Use model checkers like CDSChecker or TLA+ for critical algorithms. Fuzz the number of threads and operations. Test on different architectures (x86, ARM) if possible.

8. Summary and Next Steps

Lock-free data structures can deliver significant performance gains in high-contention scenarios, but they require careful design, thorough testing, and ongoing maintenance. The cost of getting it wrong is high: subtle bugs that cause crashes or data corruption under load.

Here are your next moves if you decide to proceed:

Start with a known pattern. Use a well-studied algorithm like the Michael-Scott queue or Treiber stack. Don't invent your own.
Use a library when possible. Consider boost.lockfree, folly, or concurrentqueue before writing your own.
Implement memory reclamation early. Choose hazard pointers, RCU, or epoch-based reclamation. Don't postpone it.
Write stress tests from day one. Use ThreadSanitizer and run on many cores. Automate the tests in CI.
Document the memory ordering. Every atomic operation should have a comment explaining why that order is correct.
Measure, measure, measure. Profile before and after. If the performance gain is less than 20%, consider whether the complexity is worth it.

Lock-free is a powerful technique, but it's not a silver bullet. Use it where it fits, and don't be afraid to fall back to locks when the cost outweighs the benefit. The best code is the code that works correctly and is maintainable by your team.

Busy Dev's Checklist for Lock-Free Data Structures in C++

Table of Contents

1. Where Lock-Free Shows Up in Real Work

Real-World Scenario: Packet Processing

2. Foundations That Are Often Misunderstood

Atomic Operations

Memory Ordering

The ABA Problem

3. Patterns That Usually Work

Lock-Free Stack (Treiber Stack)

Lock-Free Queue (Michael-Scott)

Lock-Free Hash Table (Split-Ordered Lists)

Read-Copy-Update (RCU)

Hazard Pointers

4. Anti-Patterns and Why Teams Revert

Overusing CAS Without Backoff

Ignoring Memory Reclamation

Assuming Sequential Consistency Is Free

Writing Your Own Algorithm from Scratch

Not Testing Under Contention

5. Maintenance, Drift, and Long-Term Costs

Compiler and Hardware Evolution

Code Reviews Are Harder

Integration with Lock-Based Code

Long-Term Costs

6. When Not to Use This Approach

Low Contention

Complex Data Structures

Real-Time Systems with Hard Deadlines

Prototyping or Short-Lived Code

When You Lack Testing Infrastructure

7. Open Questions and FAQ

Should I use `std::atomic` or compiler intrinsics?

How do I choose between hazard pointers and RCU?

Is lock-free always faster than a mutex?

What about wait-free algorithms?

How do I test lock-free code?

8. Summary and Next Steps

Comments (0)

Table of Contents

1. Where Lock-Free Shows Up in Real Work

Real-World Scenario: Packet Processing

2. Foundations That Are Often Misunderstood

Atomic Operations

Memory Ordering

The ABA Problem

3. Patterns That Usually Work

Lock-Free Stack (Treiber Stack)

Lock-Free Queue (Michael-Scott)

Lock-Free Hash Table (Split-Ordered Lists)

Read-Copy-Update (RCU)

Hazard Pointers

4. Anti-Patterns and Why Teams Revert

Overusing CAS Without Backoff

Ignoring Memory Reclamation

Assuming Sequential Consistency Is Free

Writing Your Own Algorithm from Scratch

Not Testing Under Contention

5. Maintenance, Drift, and Long-Term Costs

Compiler and Hardware Evolution

Code Reviews Are Harder

Integration with Lock-Based Code

Long-Term Costs

6. When Not to Use This Approach

Low Contention

Complex Data Structures

Real-Time Systems with Hard Deadlines

Prototyping or Short-Lived Code

When You Lack Testing Infrastructure

7. Open Questions and FAQ

Should I use std::atomic or compiler intrinsics?

How do I choose between hazard pointers and RCU?

Is lock-free always faster than a mutex?

What about wait-free algorithms?

How do I test lock-free code?

8. Summary and Next Steps

Share this article:

Comments (0)

Related Articles

System Programming Checklist: Expert Tips for Efficient Kernel Workflows

A Practical Checklist for Implementing Efficient System Calls in Modern C++

A Practical Checklist for Debugging System-Level Crashes and Core Dumps

Should I use `std::atomic` or compiler intrinsics?