Why Does Quicksort Beat Mergesort in Real-World Software?

Both Quicksort and Mergesort have O(n log n) average time complexity, but Quicksort is faster in practice on most hardware. The reason is cache behavior. Quicksort operates in-place and accesses memory sequentially within partitions. Mergesort allocates auxiliary memory and jumps between the original and scratch arrays, producing cache misses that cost far more than the asymptotic analysis predicts.

Analysis Briefing

Topic: Quicksort vs Mergesort cache performance in practice
Analyst: Mike D (@MrComputerScience)
Context: Sparked by a question from Claude Sonnet 4.6
Source: Pithy Cyborg | AI News Made Simple
Key Question: If they have the same complexity, why does Quicksort show up faster in every benchmark?

Why the Asymptotic Analysis Misses the Real Performance Story

Big-O notation counts operations. It does not count the cost of those operations. On modern hardware, a cache miss costs 100 to 300 CPU cycles. An L1 cache hit costs 4 cycles. An algorithm that does more operations but keeps data in cache will outperform one that does fewer operations but generates cache misses.

Mergesort requires O(n) auxiliary memory. During the merge step, the algorithm reads from two sub-arrays and writes to a separate output array. The input and output arrays compete for cache space. For large inputs, this produces a continuous stream of cache misses as the merge step reads ahead of what the cache has prefetched.

Quicksort partitions the array in-place. The partitioning step scans left to right, touching memory in the same sequential pattern that CPU prefetchers are optimized for. The partition fits entirely in cache for small enough sub-arrays. Most production Quicksort implementations switch to insertion sort below a threshold (typically 16 to 32 elements) precisely because insertion sort has excellent cache behavior on small, nearly sorted arrays.

The Worst Case and How Production Implementations Avoid It

Quicksort’s O(n²) worst case occurs when every partition step selects the minimum or maximum element as the pivot. On sorted or nearly sorted input, a naive implementation that always picks the first element as pivot degrades to quadratic time.

Production implementations use three strategies to avoid this. Median-of-three pivot selection picks the median of the first, middle, and last elements. This prevents worst-case behavior on sorted input. Randomized pivot selection chooses the pivot uniformly at random, making worst-case behavior statistically improbable regardless of input order. Introsort, used in C++ std::sort and Rust’s slice::sort_unstable, starts with Quicksort and falls back to Heapsort when recursion depth exceeds a threshold, guaranteeing O(n log n) worst case.

import random

def quicksort(arr: list, low: int = 0, high: int = None) -> None:
    if high is None:
        high = len(arr) - 1

    if low < high:
        pivot_idx = partition(arr, low, high)
        quicksort(arr, low, pivot_idx - 1)
        quicksort(arr, pivot_idx + 1, high)

def partition(arr: list, low: int, high: int) -> int:
    # Randomized pivot to avoid worst case on sorted input
    rand_idx = random.randint(low, high)
    arr[rand_idx], arr[high] = arr[high], arr[rand_idx]

    pivot = arr[high]
    i = low - 1

    for j in range(low, high):
        if arr[j] <= pivot:
            i += 1
            arr[i], arr[j] = arr[j], arr[i]

    arr[i + 1], arr[high] = arr[high], arr[i + 1]
    return i + 1

When Mergesort Is Actually the Right Choice

Mergesort has two properties that Quicksort lacks. First, it is stable: elements with equal keys maintain their relative order. Second, it guarantees O(n log n) worst case without the introsort fallback complexity.

Java’s Arrays.sort for objects uses Timsort, a hybrid of Mergesort and insertion sort. The stability requirement for object sorting makes Quicksort unsuitable there. Python’s built-in sorted() also uses Timsort for the same reason.

For external sorting, where data does not fit in memory and must be sorted from disk, Mergesort is the standard approach. The merge step maps cleanly onto sequential disk reads, and cache behavior is irrelevant when the bottleneck is I/O bandwidth rather than CPU cycles.

Use Quicksort (or your standard library’s unstable sort) for primitive types and performance-critical code. Use Mergesort or Timsort when stability matters or when sorting objects where the comparison function is expensive.

What This Means For You

Use your standard library’s sort, which is already Introsort or Timsort with all the optimizations applied, rather than implementing either algorithm yourself for production code.
Understand that O(n log n) does not mean equally fast when you benchmark sorting algorithms, because cache behavior, branch prediction, and auxiliary memory allocation all affect real performance in ways that asymptotic analysis ignores.
Choose unstable sort for primitive numeric arrays where you need maximum throughput, and stable sort when the relative order of equal elements must be preserved after sorting.
Profile before assuming sort is your bottleneck, because on arrays under 10,000 elements the sort time is rarely the performance problem and the real issue is usually elsewhere in the pipeline.

Enjoyed this deep dive? Join my inner circle:

Pithy Cyborg | AI News Made Simple → AI news made simple without hype.

Additional menu

Analysis Briefing

Why the Asymptotic Analysis Misses the Real Performance Story

The Worst Case and How Production Implementations Avoid It

When Mergesort Is Actually the Right Choice

What This Means For You

Footer

Get My Latest Artificial Intelligence Newsletter For FREE