Out-of-order execution lets a CPU execute instructions in a different sequence than the program specifies, filling pipeline stalls with independent work. When one instruction waits for data from memory, the CPU finds subsequent independent instructions and executes them first. The results are committed in program order so the behavior is identical to sequential execution.
Analysis Briefing
- Topic: CPU out-of-order execution and pipeline mechanics
- Analyst: Mike D (@MrComputerScience)
- Context: A structured investigation kicked off by Claude Sonnet 4.6
- Source: Pithy Cyborg | AI News Made Simple
- Key Question: How does the CPU legally run instructions out of order without producing wrong results?
The Pipeline Problem That Out-of-Order Execution Solves
A modern CPU pipeline has 14 to 20 stages. Instructions move through fetch, decode, issue, execute, and writeback. When an instruction in the execute stage stalls waiting for a cache miss, every stage behind it stalls too. A single L2 cache miss introduces 12 cycles of stall. An L3 miss introduces 40 cycles. DRAM access introduces 200 cycles.
In-order CPUs waste those cycles doing nothing. An in-order pipeline executing:
load r1, [address] ; cache miss, stalls 200 cycles
add r2, r3, r4 ; independent of r1, could run immediately
mul r5, r6, r7 ; also independent
will stall on the load and leave the add and mul waiting, despite the fact that neither depends on r1.
Out-of-order execution identifies those independent instructions and executes them during the stall.
The Reorder Buffer: How the CPU Tracks Out-of-Order Work
The key hardware structure is the Reorder Buffer (ROB). When instructions are decoded, they are entered into the ROB in program order and assigned to reservation stations. Reservation stations hold instructions until their input operands are ready.
When an instruction’s operands become available, it issues to an execution unit immediately, regardless of where it sits in program order. The execution unit completes the instruction and writes results to a temporary register file. The instruction stays in the ROB, marked as completed but not yet committed.
Commit (writeback) happens only from the head of the ROB, in program order. This is the mechanism that preserves correctness. No matter how wildly out-of-order execution runs, the architectural register file sees results in program order.
Modern high-performance CPUs have ROBs with 512 to 800 entries (Intel Raptor Lake: ~512, AMD Zen 4: ~320). Larger ROBs expose more instruction-level parallelism, especially when memory latency is high.
Speculative Execution and the Security Implications
Branch prediction extends out-of-order execution further. When the CPU encounters a conditional branch, it predicts the outcome and speculatively executes down the predicted path before the condition is resolved. Modern branch predictors achieve 95 to 99% accuracy, so the CPU is almost always doing useful work instead of waiting.
When a branch is mispredicted, the CPU flushes the ROB entries for speculative instructions and restarts from the correct path. The architectural state is clean because speculative results were never committed. The performance cost of a misprediction is the pipeline depth, typically 14 to 20 cycles.
Spectre and Meltdown, discovered in 2018, demonstrated that speculative execution leaves traces in the cache even when the ROB flush removes all architectural effects. An attacker can measure cache timing to infer what data the speculative execution accessed, including data the attacker is not supposed to read. The fundamental tension is that the cache is a side channel that persists through ROB flushes.
Mitigations (retpoline, IBRS, STIBP) trade some speculative execution performance for security. Systems handling sensitive data in shared execution environments (cloud VMs, browsers executing untrusted JavaScript) require these mitigations and pay the performance cost.
What This Means For You
- Write code with independent operations grouped together when performance-critical, because the CPU can only exploit instruction-level parallelism when consecutive instructions do not depend on each other’s results.
- Understand that branch misprediction costs 14 to 20 cycles, which means tight loops with unpredictable branches (like binary search on random data) can be slower than loops with more work but predictable branches.
- Enable Spectre/Meltdown mitigations on multi-tenant systems regardless of the performance cost, because the security risk of a shared execution environment without mitigations is not theoretical and has been demonstrated with practical exploits.
- Use performance counters to measure IPC (instructions per cycle) rather than guessing where pipeline stalls occur, because a CPU running at 1.2 IPC on code that should achieve 3.0 IPC is telling you exactly that there are dependency chains or cache misses eating your throughput.
Enjoyed this deep dive? Join my inner circle:
- Pithy Cyborg | AI News Made Simple → AI news made simple without hype.
