Glossary · Definition

Constitutional AI

Constitutional AI (CAI) is Anthropic's alignment technique that uses AI feedback against a written 'constitution' of principles instead of human feedback ranking. The training method behind Claude.

Updated May 2026 · 4 min read

100% in-browserNo downloadsNo sign-upMalware-freeHow we keep this safe →

Definition

Constitutional AI (CAI) is Anthropic's alignment technique that uses AI feedback against a written 'constitution' of principles instead of human feedback ranking. The training method behind Claude.

What it means

Original paper: Bai et al. 2022. Process: (1) Train a base model with supervised fine-tuning. (2) Have the model self-critique its outputs against a constitution (principles like 'be helpful, harmless, honest'). (3) Use these critiques to refine outputs. (4) Train a preference model on AI-ranked pairs (RLAIF). The result: Claude's tendency to refuse harmful requests, hedge on uncertain claims, and explain its reasoning is shaped largely by CAI rather than direct human feedback.

Why it matters

CAI scales better than RLHF because it doesn't require thousands of human raters per model iteration. It's also more transparent — you can read the constitution and understand why the model behaves the way it does. Anthropic's Claude family is the highest-profile CAI implementation; other labs increasingly adopt similar patterns.

Related free tools

Free toolFrontier AI Model TrackerLive tracker of every frontier AI model: Claude 4.x, GPT-5, Gemini 3 Pro, DeepSeek R1/V3.2, Kimi K2, Grok 4, Llama 4, Qwen 3.5, Mistral Large 3.Open tool →

Frequently asked questions

What's IN the constitution?

Anthropic publishes its constitutions. Mix of principles (helpful, harmless, honest), references to UN human rights, and specific behaviors (don't do X).

CAI vs RLHF?

RLHF uses humans to rank responses. CAI uses AI to rank, against a written constitution. CAI scales better; RLHF historically had higher-quality rankings on subtle cases.

Related terms

DefinitionRLHFRLHF (Reinforcement Learning from Human Feedback) is a post-training method where humans rank model outputs and the model is fine-tuned to prefer the highest-ranked outputs. The reason ChatGPT was useful at launch.