Why CPUs Execute Instructions Out of Order

Modern CPUs reorder instructions at runtime to keep execution units busy while waiting on memory or slow operations. Your code runs in program order logically, but physically the CPU may execute instruction 10 before instruction 3 if instruction 3 is stalled waiting on a cache miss. This is invisible in single-threaded code and catastrophic in multi-threaded code without the right memory barriers.

Pithy Cyborg | AI FAQs – The Details

Question: Why do modern CPUs execute instructions out of order, and how does that affect your multi-threaded code?

Asked by: Claude Sonnet 4.6

Answered by: Mike D (MrComputerScience)

From Pithy Cyborg | AI News Made Simple

And Pithy Security | Cybersecurity News

Why Out-of-Order Execution Exists and How It Works

A modern CPU core runs at 3 to 5 GHz but main memory latency is 60 to 100 nanoseconds, roughly 200 to 400 clock cycles. If a CPU stalled every time it waited on a memory load, it would spend the majority of its time doing nothing. Out-of-order execution (OOO) exists to solve this.

The CPU’s instruction window, called the reorder buffer (ROB), holds dozens to hundreds of in-flight instructions simultaneously. An Intel Raptor Lake core has a ROB of 512 entries. The CPU scans ahead in the instruction stream, identifies which instructions have all their inputs ready, and dispatches those first regardless of their original program order. Instructions that are stalled waiting on a cache miss sit in the ROB while independent instructions behind them execute and retire.

The hardware maintains the illusion of in-order execution through a retirement stage: results are committed to registers and memory only in original program order, even if they were computed out of order. From a single thread’s perspective, this is completely transparent. Your program behaves exactly as if instructions ran sequentially. The CPU handles all the bookkeeping invisibly.

This is why OOO CPUs routinely achieve instruction-level parallelism (ILP) of 3 to 5 instructions per cycle on real workloads despite nominally sequential instruction streams. The hardware is extracting parallelism you never explicitly wrote.

How Out-of-Order Execution Breaks Multi-Threaded Code

The single-thread transparency guarantee evaporates the moment a second core enters the picture. Out-of-order execution combined with store buffers and cache coherence protocols means that writes made by one core can become visible to other cores in a different order than the writing core performed them.

The classic example is double-checked locking, a pattern used to lazily initialize a singleton:

if (instance == NULL) {
    lock();
    if (instance == NULL) {
        instance = new Singleton(); // DANGEROUS
    }
    unlock();
}

The line instance = new Singleton() compiles to roughly: allocate memory, initialize fields, write pointer to instance. Out-of-order execution and compiler reordering can publish the pointer to instance before the initialization writes are visible to other cores. A second thread sees a non-NULL instance pointer and proceeds to use an uninitialized object. The program crashes or silently corrupts data with no apparent race condition in the code.

This is not a theoretical edge case. It is a real bug class that has appeared in production Java, C++, and Go codebases. Weaponizing CPU branch predictors is another consequence of the same speculative execution machinery that enables OOO, and Spectre exploits it at the microarchitectural level for the same fundamental reason: the CPU is doing things in a different order than the programmer intended.

What Memory Models and Barriers Actually Do

Every CPU architecture defines a memory model that specifies exactly which reorderings the hardware is allowed to perform. x86 has a relatively strong memory model (Total Store Order): stores are not reordered with other stores, loads are not reordered with other loads, but stores can be reordered with subsequent loads. ARM and RISC-V have weaker models that allow more reorderings and require more explicit barriers.

Memory barriers (also called fences) are instructions that constrain reordering. A full barrier prevents any load or store from crossing it in either direction. A store barrier ensures all preceding stores are visible before any subsequent store. A load barrier ensures all preceding loads complete before any subsequent load.

In practice, you rarely write barrier instructions directly. You use higher-level primitives:

In C++, std::atomic with appropriate memory_order parameters (memory_order_acquire, memory_order_release, memory_order_seq_cst) compiles to the correct barriers for your target architecture. In Java, volatile variables insert the necessary barriers. In Go, sync/atomic functions and channel operations handle this.

The rule of thumb: any variable shared between threads and written by at least one of them requires either atomic operations, a mutex, or explicit barriers. Anything less is a data race, and data races produce undefined behavior in C++ and implementation-defined behavior in Java and Go.

What This Means For You

Never share mutable state between threads without synchronization primitives — data races caused by OOO reordering produce bugs that are intermittent, hardware-dependent, and nearly impossible to reproduce under a debugger.
Use std::atomic in C++, volatile in Java, and sync/atomic in Go for shared flags and counters — these insert the correct barriers for your target architecture automatically.
Avoid double-checked locking without atomic operations — the pattern is broken without proper barriers and has been the source of production bugs across multiple languages.
Read your architecture’s memory model documentation if you are writing lock-free code — x86, ARM, and RISC-V have meaningfully different guarantees and code correct on x86 can fail silently on ARM.
Use ThreadSanitizer during development and CI — it instruments your binary to detect data races at runtime with low false positive rates and is the most reliable way to catch OOO-induced race conditions before production.

Pithy Cyborg | AI News Made Simple

Subscribe (Free): https://pithycyborg.substack.com/subscribe

Read archives (Free): https://pithycyborg.substack.com/archive

Pithy Security | Cybersecurity News

Subscribe (Free): https://pithysecurity.substack.com/subscribe

Read archives (Free): https://pithysecurity.substack.com/archive