Glossary · Definition
RLHF
RLHF (Reinforcement Learning from Human Feedback) is a post-training method where humans rank model outputs and the model is fine-tuned to prefer the highest-ranked outputs. The reason ChatGPT was useful at launch.
Definition
RLHF (Reinforcement Learning from Human Feedback) is a post-training method where humans rank model outputs and the model is fine-tuned to prefer the highest-ranked outputs. The reason ChatGPT was useful at launch.
What it means
Pretrained LLMs are great at predicting tokens but not at being helpful. RLHF closes the gap: collect prompt + response pairs, have humans rank responses, train a reward model to predict the rankings, fine-tune the LLM via reinforcement learning to maximize the reward. DPO (Direct Preference Optimization) is a 2023 simplification that achieves similar results without the explicit RL phase. RLAIF replaces human ranking with AI ranking for scale.
Advertisement
Why it matters
RLHF is why post-2022 LLMs feel useful instead of weird. Pre-RLHF GPT-3 was technically capable but bad at instruction-following. Post-RLHF, the same base capability becomes a usable assistant. All major labs now invest heavily in this phase — the labs with best RLHF (Anthropic, OpenAI) ship the most-aligned products.
Related free tools
Frequently asked questions
RLHF vs RLAIF?
RLHF uses humans to rank; RLAIF uses an AI to rank. RLAIF scales better but requires a strong reward model.
Can I do RLHF myself?
DPO (the simpler alternative) is feasible on consumer hardware for smaller models with libraries like Unsloth + TRL. Full RLHF is expensive but possible.