When a large language model predicts the next token, it runs your entire input through dozens of transformer layers, each one computing attention scores that determine which previous tokens influence the prediction. The result is a probability distribution over the entire vocabulary. The model samples from that distribution and outputs one token at a time.
Analysis Briefing
- Topic: LLM next-token prediction and attention mechanics
- Analyst: Mike D (@MrComputerScience)
- Context: A structured investigation kicked off by Claude Sonnet 4.6
- Source: Pithy Cyborg | AI News Made Simple
- Key Question: What is the transformer actually computing when it decides what word comes next?
From Tokens to Vectors: The Input Representation
Before attention can run, the model converts your text into tokens using a vocabulary of roughly 50,000 to 100,000 subword units. Each token gets mapped to a high-dimensional vector called an embedding, typically 4,096 to 12,288 dimensions in a frontier model. These embeddings are learned during training and encode semantic meaning. The word “king” lives near “queen” and “monarch” in that space.
Positional encodings are added to each embedding. A transformer has no inherent sense of word order, unlike an RNN. The positional encoding injects sequence information into the vector so the model knows that token 5 came before token 6.
The stack of token embeddings becomes the input matrix to the first transformer layer. Every layer sees the full sequence every time.
How Attention Decides Which Tokens Matter
Each transformer layer runs multi-head self-attention. For every token, the layer computes three vectors: a Query, a Key, and a Value. The Query for a given token asks “what am I looking for?” The Keys for all other tokens answer “what do I contain?” The dot product of a Query with all Keys produces attention scores. After scaling and softmax normalization, these scores become weights over the Value vectors.
The output for each token is a weighted sum of all Value vectors, weighted by those attention scores. High attention weight means that token’s context strongly influenced the output. A token like “it” in the sentence “The trophy didn’t fit in the suitcase because it was too big” needs to attend strongly to “trophy” or “suitcase” depending on meaning. The attention mechanism learns to resolve that during training.
Multiple attention heads run in parallel, each learning different relationship patterns. One head might track syntactic dependencies. Another might track coreference. Their outputs are concatenated and projected back to the model dimension.
The Probability Distribution and Sampling
After passing through all transformer layers (typically 32 to 96 in a frontier model), the final hidden state for the last token position gets projected by a linear layer to the vocabulary size. A softmax turns those raw scores into probabilities. The model now has a probability for every token in its vocabulary for what comes next.
Sampling behavior is controlled by temperature and top-p. Temperature below 1.0 sharpens the distribution, making the model more deterministic. Temperature above 1.0 flattens it, making outputs more varied. Top-p (nucleus sampling) restricts sampling to the smallest set of tokens whose cumulative probability exceeds p. A top-p of 0.9 means the model only samples from tokens that together account for 90% of the probability mass.
The sampled token gets appended to the input and the entire forward pass runs again for the next token. This is why inference is sequential and why generating 500 tokens takes 500 forward passes. KV caching avoids recomputing the Key and Value matrices for previously seen tokens, which is the primary optimization that makes inference tractable.
What This Means For You
- Set temperature to 0 for deterministic outputs in production pipelines where consistency matters, because temperature controls randomness and the default is rarely what you want for structured tasks.
- Understand that longer prompts increase inference cost quadratically, because attention computes pairwise relationships between all tokens and the computation scales with sequence length squared.
- Use KV caching when making multiple calls with the same system prompt, because caching the Key and Value matrices for the static prefix eliminates the most expensive part of repeated inference.
- Treat model outputs as samples from a distribution, not deterministic answers, because the same prompt at temperature 0.7 will produce meaningfully different outputs across runs.
Enjoyed this deep dive? Join my inner circle:
- Pithy Cyborg | AI News Made Simple → AI news made simple without hype.
