False sharing occurs when two threads on different CPU cores modify different variables that happen to sit on the same cache line. The hardware sees writes to the same cache line from two cores and forces cache coherence traffic, serializing what should be parallel work. Threads appear to share data when they don’t, and performance collapses to worse than single-threaded.
Analysis Briefing
- Topic: False sharing, cache coherence, and NUMA in parallel systems
- Analyst: Mike D (@MrComputerScience)
- Context: A research sprint initiated by Claude Sonnet 4.6
- Source: Pithy Cyborg | AI News Made Simple
- Key Question: How can two threads writing to completely different variables still serialize each other?
Cache Lines and the MESI Protocol
CPUs do not transfer individual bytes between cache and memory. They transfer cache lines, typically 64 bytes. When a core reads a variable, it loads the entire 64-byte line containing that variable into its L1 cache.
The MESI protocol (Modified, Exclusive, Shared, Invalid) governs cache coherence across cores. When Core 0 writes to a cache line, it transitions the line to Modified state and broadcasts an invalidation to all other cores holding copies. Core 1 must then fetch the updated line from Core 0’s cache before it can proceed with its own operation.
This is correct and necessary for actual shared data. The problem is granularity. MESI operates on cache lines, not variables. If Core 0’s counter and Core 1’s counter both fit in the same 64-byte line, every write by either core invalidates the other’s copy, even though the counters are logically independent.
A Concrete Example and How to Detect It
Consider a thread pool where each thread increments its own per-thread counter:
// This structure causes false sharing on most hardware
struct ThreadData {
long counter; // 8 bytes
};
ThreadData threads[8]; // All 8 counters fit in 64 bytes = 1 cache line
// Thread i increments threads[i].counter in a tight loop
// Result: catastrophic false sharing, near-zero scaling
Measuring the performance hit requires hardware performance counters. On Linux, perf stat -e cache-misses,cache-references ./your_binary shows cache miss rate. A program with false sharing will show an anomalously high L1 cache miss rate despite small working sets.
Intel VTune and AMD uProf provide cache line contention analysis that identifies exactly which variables share cache lines and which threads are contending.
The fix is padding each hot variable to its own cache line:
#define CACHE_LINE_SIZE 64
struct alignas(CACHE_LINE_SIZE) ThreadData {
long counter;
char padding[CACHE_LINE_SIZE - sizeof(long)];
};
ThreadData threads[8]; // Each counter now occupies its own cache line
In C++17, std::hardware_destructive_interference_size provides a portable cache line size constant. In Java, @Contended (JDK 8+) adds padding automatically, but requires -XX:-RestrictContended JVM flag in non-JDK code.
Real-World Cases Where False Sharing Appears
Thread pool task counters. Each worker thread maintains counters for tasks completed, errors encountered, and bytes processed. If counters for all workers pack into adjacent memory (as they do in a naive array of structs), every thread’s counter update invalidates every other thread’s counters. Throughput does not scale past 2 to 4 cores.
Lock-free ring buffers. Many ring buffer implementations store head and tail indices as adjacent fields in a struct. The producer updates tail and the consumer updates head. These are different variables, but they share a cache line. Every enqueue operation invalidates the consumer’s cache copy and vice versa, creating artificial contention on what is designed as a lock-free structure.
Histogram accumulators. Parallel histogram computation assigns each thread a slice of the input. Naive implementations write to a shared output array, where adjacent histogram buckets sit on the same cache lines. Threads responsible for nearby value ranges contend on the same cache lines despite writing to different buckets.
The fix in all cases is the same: ensure that data modified by different threads either sits on different cache lines through padding or is accumulated in thread-local storage and merged at the end.
What This Means For You
- Profile before padding, because adding cache line padding to every struct wastes memory and L1 capacity, and false sharing is only worth fixing on structures that are actually hot in parallel execution.
- Apply
alignas(64)to per-thread data structures in C++ as a first fix when you see unexpectedly poor parallel scaling, because it costs nothing except memory and eliminates false sharing in one line. - Measure with hardware counters, not wall time alone, because false sharing produces the confusing result that adding more threads makes performance worse, and wall time comparisons without counter data make this look like a scheduling or algorithmic problem when it is a cache problem.
- Prefer thread-local accumulation over shared arrays in parallel reduction patterns, because merging per-thread results at the end produces better cache behavior than having threads contend on shared output during computation.
Enjoyed this deep dive? Join my inner circle:
- Pithy Cyborg | AI News Made Simple → AI news made simple without hype.
