Concurrency in C++ can feel like a superpower—until your program deadlocks, data races corrupt your data, or performance degrades instead of improving. This guide cuts through the noise. We'll walk through the three main tools in modern C++: std::thread, std::async, and parallel algorithms with execution policies. By the end, you'll have a decision framework for choosing the right approach, plus checklists to avoid the most common pitfalls.
Why Concurrency Matters Now
Modern CPUs don't get much faster clock-for-clock. Instead, they pack more cores. A single-threaded C++ program leaves most of that silicon idle. If you're processing sensor data, rendering frames, or handling network requests, concurrency is no longer optional—it's the path to scaling performance.
But concurrency comes at a cost: complexity. The C++ standard library gives us powerful primitives, but using them incorrectly leads to bugs that are notoriously hard to reproduce. Teams often find that a naive threading approach makes the application slower due to context switching, false sharing, or lock contention.
This guide is for C++ developers who already know the basics of the language and want to add concurrency to their toolbox. We assume you've written a class or two and are comfortable with pointers and references. We won't rehash the entire standard; we'll focus on what works in practice and what breaks.
We'll use a running example: processing a large collection of high-resolution images. Each image needs to be decoded, filtered, and saved. This is an embarrassingly parallel workload—ideal for concurrency—but it also exposes the trade-offs between thread management, async launches, and parallel algorithms.
Core Idea in Plain Language
Concurrency means making progress on multiple tasks at the same time. In C++, you have three main levers:
- std::thread: A low-level handle to an OS thread. You create it, you join or detach it. You manage synchronization yourself.
- std::async: A higher-level interface that returns a
std::future. The runtime decides whether to run the task on a new thread, a thread pool, or synchronously (lazy evaluation). - Parallel algorithms: C++17 added execution policies like
std::execution::parandstd::execution::par_unseq. You callstd::for_eachorstd::sortwith a policy, and the library splits the work across threads automatically.
Each tool has a sweet spot. std::thread gives you full control but requires careful resource management. std::async abstracts away thread creation but can surprise you with its launch policy (deferred vs. async). Parallel algorithms are the easiest to use but only work for data-parallel operations on random-access ranges.
Let's illustrate with our image processing scenario. Suppose we have 10,000 images. A naive approach: spawn one thread per image. That would create thousands of threads, overwhelming the OS scheduler and causing massive overhead. A better approach: use a thread pool (manually or via std::async with a reasonable limit) or use std::for_each with parallel execution policy.
Here's the key insight: concurrency is not parallelism. Concurrency is about structure (dealing with many things at once); parallelism is about speed (doing many things at once). You can have concurrency without parallelism (e.g., a single-core machine interleaving tasks). For performance, you want parallelism, which requires multiple cores and careful design to avoid bottlenecks.
How It Works Under the Hood
Thread Creation and Scheduling
When you create a std::thread, the C++ runtime calls the OS's thread creation API (e.g., pthread_create on Linux, CreateThread on Windows). Each thread gets its own stack (typically 1–8 MB). Creating a thread is not cheap—it takes microseconds and consumes kernel resources. That's why creating a thread per task is wasteful for fine-grained work.
The OS scheduler decides which thread runs on which core. Context switching (saving and restoring thread state) costs tens of nanoseconds to a few microseconds. If you have more active threads than cores, the system spends significant time switching, reducing throughput.
std::async and std::future
std::async wraps a callable and returns a std::future. The launch policy can be std::launch::async (run in a new thread), std::launch::deferred (run when .get() is called), or the default std::launch::async | std::launch::deferred (implementation-defined). Most implementations (libstdc++, libc++) use a thread pool for async tasks, reusing threads to avoid the cost of creating new ones. However, the default policy can be dangerous: if the implementation chooses deferred, your program may become single-threaded unexpectedly.
When you call .get() on a future, it blocks until the result is ready. If the task was deferred, it runs synchronously at that point. This can cause subtle deadlocks if you're not careful.
Parallel Algorithms and Execution Policies
C++17's parallel algorithms are implemented using internal thread pools or SIMD instructions. The policy std::execution::par allows parallel execution but assumes no data races—you must ensure that the operation on each element is independent. std::execution::par_unseq additionally allows vectorization, which means the same element might be accessed from different threads or SIMD lanes simultaneously. This is even stricter: you cannot use mutexes or atomic operations that might block.
Internally, the library divides the range into chunks and dispatches them to worker threads. The chunk size is typically chosen to balance load: too small, and overhead dominates; too large, and some threads starve. Some implementations use work-stealing to improve load balance.
For our image processing example, using std::for_each with std::execution::par is the simplest: we write a lambda that processes one image, and the library handles threading. But we must ensure the lambda is thread-safe—no shared mutable state without synchronization.
Worked Example: Processing Images in Parallel
Scenario
We have 10,000 JPEG images (each ~5 MB) stored in a vector of file paths. For each image, we need to decode it (using a library like libjpeg), apply a filter (e.g., grayscale), and save the result to disk. This is I/O-bound (reading/writing) and CPU-bound (decoding/filtering).
Approach 1: std::thread with a Pool
We create a fixed number of threads (e.g., std::thread::hardware_concurrency()) and distribute the work using a queue. This gives us control over resource usage. Here's a skeleton:
void process_images(const std::vector<std::string>& paths) {
const size_t num_threads = std::thread::hardware_concurrency();
std::vector<std::thread> threads;
std::atomic<size_t> index{0};
for (size_t i = 0; i < num_threads; ++i) {
threads.emplace_back([&] {
while (true) {
size_t idx = index.fetch_add(1, std::memory_order_relaxed);
if (idx >= paths.size()) break;
process_one(paths[idx]);
}
});
}
for (auto& t : threads) t.join();
}
This works, but we must ensure process_one is thread-safe (no shared state). The atomic index avoids a mutex for task distribution, but if process_one writes to a shared log, we need a mutex.
Approach 2: std::async with Default Policy
We can launch each image processing as an async task:
std::vector<std::future<void>> futures;
for (const auto& path : paths) {
futures.push_back(std::async(std::launch::async, process_one, path));
}
for (auto& f : futures) f.get();
This is simple, but it may create thousands of threads if the implementation uses std::launch::async directly. Most implementations use a thread pool, but the number of concurrent tasks is unbounded, leading to memory pressure. A better variant is to limit concurrency by batching or using a semaphore.
Approach 3: Parallel Algorithm
std::for_each(std::execution::par, paths.begin(), paths.end(), process_one);
This is the cleanest. The library manages the thread pool and chunking. However, process_one must be safe for parallel execution: no mutexes that could deadlock with internal library threads, and no reliance on thread-local storage that might not be initialized correctly.
Trade-offs
In practice, the parallel algorithm approach is often the best for pure data-parallel workloads. It's concise, less error-prone, and performs well. The thread pool approach gives you more control (e.g., priority, cancellation), but you have to write and debug the infrastructure. std::async is a middle ground, but the default policy can be unpredictable—always specify std::launch::async if you want guaranteed parallelism.
Edge Cases and Exceptions
Data Races and Undefined Behavior
The most common concurrency bug is the data race: two threads access the same memory location without synchronization, and at least one writes. In C++, this is undefined behavior—the program can crash, produce wrong results, or appear to work until a different compiler or hardware exposes it. Use std::mutex, std::atomic, or higher-level constructs like std::future to protect shared data.
A subtle example: incrementing a shared counter. Even if you use std::atomic<int>, the default memory ordering (std::memory_order_seq_cst) is safe but slow. For a simple counter, relaxed ordering might suffice, but you must understand the guarantees. When in doubt, use sequential consistency—it's easier to reason about.
Deadlocks
Deadlocks occur when two threads each hold a lock the other needs. The classic prevention: always lock mutexes in the same order. Use std::lock to acquire multiple locks atomically. For example:
std::lock(mutex1, mutex2);
std::lock_guard<std::mutex> lock1(mutex1, std::adopt_lock);
std::lock_guard<std::mutex> lock2(mutex2, std::adopt_lock);
Exception Safety
If a std::thread's function throws an exception, the program calls std::terminate unless you catch it inside the thread. Always wrap thread functions in a try-catch and propagate the exception via a std::promise or std::future. With std::async, exceptions are stored in the future and rethrown on .get()—much safer.
Thread Sanitizer
Use Clang's ThreadSanitizer (TSan) during development. It detects data races and many deadlocks at runtime. Add -fsanitize=thread -g -O1 to your build. TSan will report races even if they don't cause visible bugs. Fix them all—they are undefined behavior.
Limits of the Approach
Amdahl's Law and Overhead
Even with perfect parallelism, the serial portion of your program limits speedup. For example, if 10% of the work is serial (e.g., reading the file list), the maximum speedup is 10x, no matter how many cores you throw at it. Measure your serial fraction using a profiler.
Parallelism itself has overhead: thread creation, synchronization, and cache coherence traffic. For very small tasks (e.g., adding two integers), the overhead dwarfs the benefit. A rule of thumb: only parallelize tasks that take at least a few microseconds.
Memory Bandwidth
Many cores can saturate the memory bus. In our image processing example, if all threads try to read from disk simultaneously, the I/O subsystem becomes the bottleneck. Consider using asynchronous I/O (e.g., std::async with std::launch::async for I/O, and a separate thread pool for CPU work).
False Sharing
When two threads modify variables that happen to be on the same cache line, the cache coherence protocol forces expensive invalidations. This can cause a 10x slowdown. Avoid false sharing by padding data structures to cache line boundaries (usually 64 bytes). Use alignas(64) or std::hardware_destructive_interference_size (C++17).
Debugging Difficulty
Concurrency bugs are non-deterministic: they may only appear under specific load or on certain hardware. Use tools like TSan, Helgrind (Valgrind), and std::atomic's debug mode (libstdc++ has _GLIBCXX_DEBUG). Write unit tests that run many iterations with different thread counts.
Reader FAQ
When should I use std::thread vs std::async?
Use std::thread when you need fine-grained control over thread attributes (stack size, scheduling priority) or when you need a long-running background thread (e.g., a network listener). Use std::async for most short-lived tasks, especially when you want exception propagation and a future-based interface. Always specify std::launch::async to avoid surprises.
How do I choose the number of threads?
For CPU-bound tasks, use std::thread::hardware_concurrency(). For I/O-bound tasks, you may need more threads to keep the CPU busy while waiting for I/O. Experiment: start with the hardware concurrency count and increase until performance plateaus or degrades.
Can I use std::async with a thread pool?
Most implementations already use a thread pool internally. If you need a custom pool (e.g., for priority queues), you'll have to build one or use a library like Intel TBB or Boost.Asio. The standard library does not expose a thread pool interface directly.
Are parallel algorithms always faster?
No. For small ranges, the overhead of spawning threads and splitting work outweighs the benefit. Benchmark with your actual data size. Also, parallel algorithms may not be faster if your operation is memory-bound—the extra threads just contend for bandwidth.
What about std::jthread (C++20)?
std::jthread is a joining thread that automatically joins on destruction and supports cooperative cancellation via a stop token. It's a safer alternative to std::thread because you can't forget to join. Use it for long-running threads that need to be interrupted.
How do I handle exceptions in std::thread?
Catch all exceptions inside the thread function and store them in a std::promise or a shared queue. Alternatively, use std::async which propagates exceptions via the future. Never let an exception escape a thread—it will call std::terminate.
What is the best way to synchronize access to a shared resource?
Start with a std::mutex and std::lock_guard. If the resource is simple (e.g., a counter), use std::atomic. For read-heavy workloads, consider std::shared_mutex (C++17) to allow concurrent readers. Avoid rolling your own lock-free data structures unless you have deep expertise—they are notoriously hard to get right.
Can I mix std::async and std::thread in the same program?
Yes, but be careful about thread counts. If you create many std::thread objects and also launch async tasks, you may oversubscribe the CPU. Monitor the total number of active threads and consider using a single thread pool for all concurrency needs.
Now that you have a solid understanding of the tools and trade-offs, here are your next steps: (1) Profile your current single-threaded application to identify the bottleneck. (2) Pick one concurrency method—prefer parallel algorithms for data-parallel loops, std::async for task parallelism, and std::jthread for long-running threads. (3) Enable ThreadSanitizer in your debug build and fix every race. (4) Benchmark with different thread counts and chunk sizes. (5) Document your concurrency design: which data is shared, which mutex protects it, and the locking order. This last step saves hours of debugging later.
Comments (0)
Please sign in to post a comment.
Don't have an account? Create one
No comments yet. Be the first to comment!