False sharing happens when two CPU cores modify different variables that happen to live on the same cache line. The hardware treats the entire line as contested and forces expensive synchronization on every write, even though the cores are logically working on independent data. It is silent, produces no errors, and can reduce parallel speedup by 10x or more.
Pithy Cyborg | AI FAQs – The Details
Question: What is false sharing in CPU cache and how does it silently destroy parallel program performance?
Asked by: Claude Sonnet 4.6
Answered by: Mike D (MrComputerScience)
From Pithy Cyborg | AI News Made Simple
And Pithy Security | Cybersecurity News
Why CPU Cache Lines Are the Root of the Problem
CPUs do not load individual bytes from memory. They load cache lines, typically 64 bytes on x86 and ARM in 2026. Every read and write operates on the full cache line containing the target address, whether you asked for 1 byte or 64.
This matters enormously for multi-core systems because cache coherence protocols like MESI (Modified, Exclusive, Shared, Invalid) track ownership at the cache line level, not the variable level. When core 0 writes to a cache line, the protocol invalidates every other core’s copy of that entire line. Core 1 must then fetch the updated line from core 0’s cache or from main memory before it can read or write anything on that line, even a completely unrelated variable.
False sharing is what happens when two cores are writing to different variables that share a cache line. Neither core is actually accessing the other’s data. But the hardware cannot tell the difference. It sees two cores hammering the same cache line and enforces full coherence protocol overhead on every write. The result looks like a data race in terms of performance, without any of the logical data dependency that would justify the cost.
On a modern 32-core server, a tight loop with false sharing can perform worse than the single-threaded version. You added 32 cores and made it slower. That is false sharing in action.
How to Identify False Sharing in a Real Codebase
False sharing is insidious precisely because it produces no incorrect output and no compiler warnings. Your program produces the right answer, just very slowly. Standard profilers show high CPU utilization and low throughput, which looks identical to legitimate compute-bound work.
The tell-tale hardware signal is a high rate of LLC (last-level cache) misses combined with high rates of cache line invalidations. On Linux, perf stat -e cache-misses,cache-references,LLC-load-misses will surface the cache miss rate. Intel VTune’s memory access analysis can pinpoint specific cache lines experiencing false sharing. On ARM, equivalent counters are accessible through the PMU (Performance Monitoring Unit).
The structural code pattern to look for is arrays of small independently-modified values accessed in parallel. A classic example: a thread pool where each thread writes its result into results[thread_id], and the results array is a contiguous array of integers. If each integer is 4 bytes and a cache line is 64 bytes, sixteen thread results share each cache line. Sixteen threads hammering sixteen adjacent integers produce maximum false sharing.
Counter-intuitively, Spectre side-channel attacks exploit the same cache line granularity that makes false sharing possible. Both are consequences of hardware that optimizes for throughput by operating on 64-byte chunks rather than individual bytes.
How to Fix False Sharing With Padding and Alignment
The standard fix is padding. You force each independently-written variable onto its own cache line by adding enough unused bytes to push the next variable past the 64-byte boundary.
In C and C++, this looks like:
struct alignas(64) ThreadCounter {
int64_t value;
char padding[56]; // pad to 64 bytes total
};
ThreadCounter counters[NUM_THREADS];
Each ThreadCounter now occupies exactly one cache line. Thread N writes counters[N].value and touches only its own cache line. No false sharing.
C++17 introduced std::hardware_destructive_interference_size, which returns the platform’s cache line size at compile time, making the padding portable without hardcoding 64. Use it instead of magic numbers.
For Java and other managed runtimes, the @Contended annotation (available since Java 8 with -XX:-RestrictContended) instructs the JVM to pad the annotated field to its own cache line. Disruptor, the high-performance inter-thread messaging library, was built around this technique and demonstrates 25 to 50 million messages per second throughput that would be impossible with naive shared-array designs.
The tradeoff is memory. Padding wastes 56 bytes per thread counter to buy cache line isolation. For 32 threads that is 1,792 bytes of waste. That is almost always worth it. At 10,000 threads it requires more thought.
What This Means For You
- Profile with hardware counters first — high LLC miss rates in parallel code with low actual data sharing is the fingerprint of false sharing, not a memory bottleneck.
- Audit any struct or array where multiple threads write adjacent fields — this is the most common false sharing pattern and it appears in thread pools, metrics collectors, and lock-free queues.
- Use alignas(64) or std::hardware_destructive_interference_size to pad hot per-thread data structures in C and C++, and @Contended in Java.
- Never benchmark parallel code on a single core — false sharing only manifests under genuine multi-core contention and will be invisible in single-threaded tests.
- Check your metrics and telemetry infrastructure — per-thread counters that are aggregated at reporting time are a common false sharing source in production services, and the performance cost shows up as degraded throughput rather than high CPU usage.
Pithy Cyborg | AI News Made Simple
Subscribe (Free): https://pithycyborg.substack.com/subscribe
Read archives (Free): https://pithycyborg.substack.com/archive
Pithy Security | Cybersecurity News
Subscribe (Free): https://pithysecurity.substack.com/subscribe
Read archives (Free): https://pithysecurity.substack.com/archive
