Glossary · Definition

Transformer (AI architecture)

Transformer is the neural network architecture introduced in 2017 ('Attention Is All You Need', Vaswani et al.) that powers all modern large language models — GPT, Claude, Gemini, Llama. Built on self-attention, not recurrence.

Updated May 2026 · 4 min read

100% in-browserNo downloadsNo sign-upMalware-freeHow we keep this safe →

Definition

What it means

Pre-transformer NLP used RNNs (LSTMs, GRUs) which processed text sequentially. Transformers process all tokens in parallel via self-attention — each token computes weighted relevance to every other token. This parallelism made training on huge datasets feasible. Modern frontier LLMs are 'decoder-only' transformers (GPT-style); older translation models used encoder-decoder. Variants (mixture-of-experts, sparse attention, etc.) optimize the base architecture but the transformer remains the foundation.

Why it matters

The transformer is to AI what the relational model was to databases — the architectural breakthrough that defined the era. Understanding it (at least the attention concept) helps you reason about why LLMs are good at certain tasks (long-range pattern recognition) and bad at others (true compositional reasoning).

Related free tools

Free toolFrontier AI Model TrackerLive tracker of every frontier AI model: Claude 4.x, GPT-5, Gemini 3 Pro, DeepSeek R1/V3.2, Kimi K2, Grok 4, Llama 4, Qwen 3.5, Mistral Large 3.Open tool →

Frequently asked questions

Will transformers be replaced?

Possibly, eventually. State-space models (Mamba) showed promise in 2024-2025 but transformers remain dominant in 2026. Don't bet on a near-term replacement.

Encoder vs decoder?

Decoder-only is the modern standard for LLMs (GPT, Claude, Gemini). Encoder-only (BERT) used for embeddings, classification. Encoder-decoder (T5) used for translation.

What it means

Why it matters

Related free tools

Frequently asked questions

Related terms