Transformer attention scales quadratically because every token attends to every other token. Double the sequence length and attention computation quadruples. This is not a bug or an oversight in the original design. It is a direct consequence of how attention works, and it is the primary reason extending context windows beyond 100K tokens requires architectural changes rather than just more compute.
Pithy Cyborg | AI FAQs – The Details
Question: Why does transformer attention scale quadratically with sequence length — and what are researchers doing about it?
Asked by: Claude Sonnet 4.6
Answered by: Mike D (MrComputerScience)
From Pithy Cyborg | AI News Made Simple
And Pithy Security | Cybersecurity News
Where the Quadratic Cost Actually Comes From
The attention mechanism computes a similarity score between every pair of tokens in the sequence. For a sequence of length n, that means n × n comparisons: token 1 against tokens 1 through n, token 2 against tokens 1 through n, and so on. The result is an n × n attention matrix.
This matrix must be computed, stored, and used to weight value vectors for every attention head in every layer. With 32 attention heads and 32 layers, a single forward pass produces 1,024 attention matrices. Each one is n × n floats. At sequence length 1,000 that is manageable. At sequence length 100,000 each attention matrix is 10 billion entries. That does not fit in GPU memory at any reasonable batch size.
The time complexity is O(n²d) where d is the model’s hidden dimension. The memory complexity is O(n²) just for the attention matrices. Both scale quadratically with n. This means processing a 1 million token document requires roughly 1,000 times more memory than processing a 1,000 token document, not 1,000 times more compute as linear scaling would imply.
Flash Attention, introduced by Dao et al. in 2022 and now the standard implementation in every major framework, addresses the memory problem without changing the mathematical complexity. It reorders the attention computation to avoid materializing the full n × n matrix in GPU high-bandwidth memory, using tiled computation in SRAM instead. This reduces memory from O(n²) to O(n) in practice and improves throughput by 2 to 4x. The algorithmic complexity remains O(n²d), but Flash Attention makes the constants small enough to handle sequences up to around 100K to 200K tokens on current hardware.
The Main Research Approaches to Sub-Quadratic Attention
Eliminating the quadratic bottleneck entirely requires changing what attention computes, and researchers have taken several distinct approaches with different tradeoffs.
Sparse attention restricts each token to attending only to a subset of other tokens rather than all n. Longformer uses a sliding window of local attention plus global attention for special tokens. BigBird adds random attention on top of local and global patterns. Both achieve O(n) complexity but sacrifice the ability to capture arbitrary long-range dependencies in a single layer. The model must propagate information across multiple layers to connect distant tokens, which works for many tasks but degrades on tasks requiring direct long-range comparison.
Linear attention replaces the softmax operation with a kernel function that allows the attention computation to be reordered into a matrix product of shape d × d rather than n × n. This achieves O(n) complexity but changes the expressiveness of what attention can represent. Models like Performer and cosFormer use this approach. Empirically, linear attention models underperform standard attention on language modeling at scale, which is why no frontier language model uses pure linear attention in 2026.
State space models (SSMs) take a different approach entirely by replacing attention with a recurrent formulation that processes sequences in O(n) time and O(1) memory per token during inference. Mamba, introduced in late 2023, demonstrated competitive language modeling performance with significantly better long-sequence efficiency. What role sparse attention plays in scaling transformer models beyond 1M token contexts explores these tradeoffs in detail.
Where the Field Actually Stands in 2026
No single approach has displaced standard attention for frontier language models, but the landscape has shifted significantly.
Hybrid architectures combining attention layers with SSM or linear attention layers are now common in open-weight models. Jamba (AI21 Labs) interleaves Mamba SSM blocks with transformer attention blocks. Mistral’s Codestral uses sliding window attention combined with full attention on specific layers. These hybrids capture the efficiency of sub-quadratic methods on most tokens while preserving full attention for the positions where it matters most.
For context windows beyond 1M tokens, the practical solutions in production are not sub-quadratic attention but engineering workarounds: retrieval-augmented generation (RAG) to avoid processing the full context in one pass, hierarchical summarization to compress long documents before attention, and speculative decoding to amortize the cost across generated tokens.
Anthropic’s Claude, Google’s Gemini 1.5, and OpenAI’s GPT-4o all support context windows of 128K to 1M tokens using Flash Attention variants combined with position encoding tricks like RoPE (Rotary Position Embeddings) and YaRN (Yet another RoPE extensioN). These are engineering solutions that push the quadratic wall further out rather than removing it. The cost per token still scales quadratically. The hardware has simply gotten fast enough to make 128K tokens tractable at inference time.
What This Means For You
- Use RAG instead of stuffing entire documents into context for tasks that require querying large corpora. The quadratic cost of 500K token contexts is real and expensive even when the API makes it look simple.
- Understand Flash Attention’s role. It reduces memory overhead without changing algorithmic complexity, so it helps with throughput but does not make unbounded context windows free.
- Watch hybrid SSM-attention architectures for long-document tasks. Mamba-based models process long sequences at a fraction of the cost of pure transformer models and the quality gap is narrowing in 2026.
- Profile attention layer costs specifically when optimizing transformer inference. Attention dominates compute at long sequence lengths while feedforward layers dominate at short ones, which changes your optimization priorities.
- Do not conflate context window size with context window quality. Models lose coherence on information in the middle of very long contexts even when the architecture supports the length, a problem known as the lost-in-the-middle effect that quadratic scaling does not address.
Pithy Cyborg | AI News Made Simple
Subscribe (Free): https://pithycyborg.substack.com/subscribe
Read archives (Free): https://pithycyborg.substack.com/archive
Pithy Security | Cybersecurity News
Subscribe (Free): https://pithysecurity.substack.com/subscribe
Read archives (Free): https://pithysecurity.substack.com/archive
