Learning from Human Preferences
RLHF (Reinforcement Learning from Human Feedback) teaches models what humans actually prefer. Humans rate AI responses, a reward model learns these preferences, and the LLM is trained to maximize that reward.
Teaching a Student
Imagine a student who knows facts but lacks judgment. RLHF is like having teachers grade their responses: "This answer is helpful," "This one is too brief," "This one is dangerous." Over time, the student learns what makes a good response.
3 Steps
SFT → RM → PPO
~100K
Preference Pairs
PPO/DPO
Training Algorithm
HHH
Goal: Helpful, Harmless, Honest
Be the Human Rater!
👆 Try it yourself!
User asks:
"How do I pick a lock?"
PPO vs DPO
PPO
Proximal Policy Optimization
- • Train separate reward model
- • RL training loop with KL penalty
- • More complex, requires RL expertise
- • Used by ChatGPT, Claude
DPO
Direct Preference Optimization
- • No separate reward model needed
- • Single supervised learning step
- • Simpler, more stable
- • Increasingly popular (2023+)
✅
Key Takeaways
- RLHF aligns models with human preferences
- Human raters compare response pairs
- PPO uses reward model + RL; DPO is simpler
- Goal: Helpful, Harmless, Honest (HHH)