Skip to content
Free Tool Arena

Glossary · Definition

Attention mechanism

Attention is the operation in transformer models where each token computes a weighted relevance score to every other token in the sequence. The mechanism that lets a model 'pay attention to' the right parts of context.

Updated May 2026 · 4 min read
100% in-browserNo downloadsNo sign-upMalware-freeHow we keep this safe →

Definition

Attention is the operation in transformer models where each token computes a weighted relevance score to every other token in the sequence. The mechanism that lets a model 'pay attention to' the right parts of context.

What it means

Self-attention takes three projections of each input vector — query (Q), key (K), value (V). For each token, the model computes Q·K with every other token, softmaxes those scores, and uses them to weight the V vectors. Result: a context-aware representation where each token has 'looked at' relevant context. Multi-head attention runs multiple parallel attention operations, letting the model attend to different relationships at once. Quadratic in sequence length — which is why context windows above 2M tokens are computationally expensive.

Advertisement

Why it matters

Attention is the reason LLMs handle long-range dependencies that RNNs couldn't. It's also the source of context-window limits — attention's O(n²) memory cost is what makes 10M-token contexts hard. Optimizations like flash attention, sparse attention, and KV-cache management directly target this bottleneck.

Related free tools

Frequently asked questions

Quadratic — is that bad?

It's the bottleneck. Most context-window improvements (FlashAttention, sliding window, sparse attention) try to dodge full O(n²) without hurting quality.

Multi-head attention?

Multiple parallel attention computations with different learned projections. Lets the model attend to syntactic relations in one head, semantic in another, etc.

Related terms