Glossary · Definition

Attention mechanism

Attention is the operation in transformer models where each token computes a weighted relevance score to every other token in the sequence. The mechanism that lets a model 'pay attention to' the right parts of context.

Updated May 2026 · 4 min read

100% in-browserNo downloadsNo sign-upMalware-freeHow we keep this safe →

Definition

What it means

Self-attention takes three projections of each input vector — query (Q), key (K), value (V). For each token, the model computes Q·K with every other token, softmaxes those scores, and uses them to weight the V vectors. Result: a context-aware representation where each token has 'looked at' relevant context. Multi-head attention runs multiple parallel attention operations, letting the model attend to different relationships at once. Quadratic in sequence length — which is why context windows above 2M tokens are computationally expensive.

Why it matters

Attention is the reason LLMs handle long-range dependencies that RNNs couldn't. It's also the source of context-window limits — attention's O(n²) memory cost is what makes 10M-token contexts hard. Optimizations like flash attention, sparse attention, and KV-cache management directly target this bottleneck.

Related free tools

Free toolLLM Context Window CalculatorCheck if your input + output tokens fit in any major LLM (GPT-4o, Claude, Gemini, Llama, Mistral) — see headroom and percent used.Open tool →

Frequently asked questions

Quadratic — is that bad?

It's the bottleneck. Most context-window improvements (FlashAttention, sliding window, sparse attention) try to dodge full O(n²) without hurting quality.

Multi-head attention?

Multiple parallel attention computations with different learned projections. Lets the model attend to syntactic relations in one head, semantic in another, etc.

What it means

Why it matters

Related free tools

Frequently asked questions

Related terms