Glossary · Definition
Attention mechanism
Attention is the operation in transformer models where each token computes a weighted relevance score to every other token in the sequence. The mechanism that lets a model 'pay attention to' the right parts of context.
Definition
Attention is the operation in transformer models where each token computes a weighted relevance score to every other token in the sequence. The mechanism that lets a model 'pay attention to' the right parts of context.
What it means
Self-attention takes three projections of each input vector — query (Q), key (K), value (V). For each token, the model computes Q·K with every other token, softmaxes those scores, and uses them to weight the V vectors. Result: a context-aware representation where each token has 'looked at' relevant context. Multi-head attention runs multiple parallel attention operations, letting the model attend to different relationships at once. Quadratic in sequence length — which is why context windows above 2M tokens are computationally expensive.
Advertisement
Why it matters
Attention is the reason LLMs handle long-range dependencies that RNNs couldn't. It's also the source of context-window limits — attention's O(n²) memory cost is what makes 10M-token contexts hard. Optimizations like flash attention, sparse attention, and KV-cache management directly target this bottleneck.
Related free tools
Frequently asked questions
Quadratic — is that bad?
It's the bottleneck. Most context-window improvements (FlashAttention, sliding window, sparse attention) try to dodge full O(n²) without hurting quality.
Multi-head attention?
Multiple parallel attention computations with different learned projections. Lets the model attend to syntactic relations in one head, semantic in another, etc.
Related terms
- DefinitionTransformer (AI architecture)Transformer is the neural network architecture introduced in 2017 ('Attention Is All You Need', Vaswani et al.) that powers all modern large language models — GPT, Claude, Gemini, Llama. Built on self-attention, not recurrence.
- DefinitionContext windowThe context window is the maximum amount of text (in tokens) an AI model can process in a single request — combining your system prompt, conversation history, and output. Past the limit, the model can't 'see' earlier content.