Glossary · Definition
AI alignment
AI alignment is the technical field of building AI systems that pursue the goals their designers actually intended — not what the designers technically programmed. Includes both 'don't kill us all' research and practical 'don't lie / refuse to help / be useful' work.
Definition
AI alignment is the technical field of building AI systems that pursue the goals their designers actually intended — not what the designers technically programmed. Includes both 'don't kill us all' research and practical 'don't lie / refuse to help / be useful' work.
What it means
Two related but distinct concerns: (1) Inner alignment — does the model actually optimize for the training objective? (2) Outer alignment — does the training objective match what we actually want? In 2026 production AI, alignment work mostly looks like: RLHF / RLAIF, Constitutional AI, red-teaming, evaluation harnesses for refusal behavior, content policy enforcement. Existential-risk alignment is more research-flavored — interpretability, scalable oversight, faithful reasoning.
Advertisement
Why it matters
Alignment is why your assistant doesn't help you make weapons but does help you debug code. The boring practical alignment (refusing harmful requests, faithful citations, calibrated uncertainty) compounds into product trust. The ambitious alignment (don't lose control of superintelligent agents) is still active research.
Related free tools
Frequently asked questions
Is alignment 'solved' for current models?
No, just better than 5 years ago. Models still have failure modes — sycophancy, deceptive reasoning under pressure, jailbreaks. Continued research.
Best entry resource?
Anthropic's papers (CAI, scaling oversight, mechanistic interp). For broader: AI Safety Fundamentals course (free).
Related terms
- DefinitionRLHFRLHF (Reinforcement Learning from Human Feedback) is a post-training method where humans rank model outputs and the model is fine-tuned to prefer the highest-ranked outputs. The reason ChatGPT was useful at launch.
- DefinitionConstitutional AIConstitutional AI (CAI) is Anthropic's alignment technique that uses AI feedback against a written 'constitution' of principles instead of human feedback ranking. The training method behind Claude.