Glossary · Definition
Constitutional AI
Constitutional AI (CAI) is Anthropic's alignment technique that uses AI feedback against a written 'constitution' of principles instead of human feedback ranking. The training method behind Claude.
Definition
Constitutional AI (CAI) is Anthropic's alignment technique that uses AI feedback against a written 'constitution' of principles instead of human feedback ranking. The training method behind Claude.
What it means
Original paper: Bai et al. 2022. Process: (1) Train a base model with supervised fine-tuning. (2) Have the model self-critique its outputs against a constitution (principles like 'be helpful, harmless, honest'). (3) Use these critiques to refine outputs. (4) Train a preference model on AI-ranked pairs (RLAIF). The result: Claude's tendency to refuse harmful requests, hedge on uncertain claims, and explain its reasoning is shaped largely by CAI rather than direct human feedback.
Advertisement
Why it matters
CAI scales better than RLHF because it doesn't require thousands of human raters per model iteration. It's also more transparent — you can read the constitution and understand why the model behaves the way it does. Anthropic's Claude family is the highest-profile CAI implementation; other labs increasingly adopt similar patterns.
Related free tools
Frequently asked questions
What's IN the constitution?
Anthropic publishes its constitutions. Mix of principles (helpful, harmless, honest), references to UN human rights, and specific behaviors (don't do X).
CAI vs RLHF?
RLHF uses humans to rank responses. CAI uses AI to rank, against a written constitution. CAI scales better; RLHF historically had higher-quality rankings on subtle cases.