Glossary · Definition
Transformer (AI architecture)
Transformer is the neural network architecture introduced in 2017 ('Attention Is All You Need', Vaswani et al.) that powers all modern large language models — GPT, Claude, Gemini, Llama. Built on self-attention, not recurrence.
Definition
Transformer is the neural network architecture introduced in 2017 ('Attention Is All You Need', Vaswani et al.) that powers all modern large language models — GPT, Claude, Gemini, Llama. Built on self-attention, not recurrence.
What it means
Pre-transformer NLP used RNNs (LSTMs, GRUs) which processed text sequentially. Transformers process all tokens in parallel via self-attention — each token computes weighted relevance to every other token. This parallelism made training on huge datasets feasible. Modern frontier LLMs are 'decoder-only' transformers (GPT-style); older translation models used encoder-decoder. Variants (mixture-of-experts, sparse attention, etc.) optimize the base architecture but the transformer remains the foundation.
Advertisement
Why it matters
The transformer is to AI what the relational model was to databases — the architectural breakthrough that defined the era. Understanding it (at least the attention concept) helps you reason about why LLMs are good at certain tasks (long-range pattern recognition) and bad at others (true compositional reasoning).
Related free tools
Frequently asked questions
Will transformers be replaced?
Possibly, eventually. State-space models (Mamba) showed promise in 2024-2025 but transformers remain dominant in 2026. Don't bet on a near-term replacement.
Encoder vs decoder?
Decoder-only is the modern standard for LLMs (GPT, Claude, Gemini). Encoder-only (BERT) used for embeddings, classification. Encoder-decoder (T5) used for translation.
Related terms
- DefinitionLLM (Large Language Model)An LLM (Large Language Model) is a transformer-based neural network trained on huge text datasets to predict the next token. ChatGPT, Claude, Gemini, DeepSeek — all are LLMs.
- DefinitionContext windowThe context window is the maximum amount of text (in tokens) an AI model can process in a single request — combining your system prompt, conversation history, and output. Past the limit, the model can't 'see' earlier content.