Phase 5: Alignment~10 minadvanced

🏆RLHF & DPO

Learning from Human Preferences

Reward modeling, PPO, DPO, Constitutional AI, and aligning models with human values.

Reward ModelsPPODPOConstitutional AI

Learning from Human Preferences

RLHF (Reinforcement Learning from Human Feedback) teaches models what humans actually prefer. Humans rate AI responses, a reward model learns these preferences, and the LLM is trained to maximize that reward.

Teaching a Student

Imagine a student who knows facts but lacks judgment. RLHF is like having teachers grade their responses: "This answer is helpful," "This one is too brief," "This one is dangerous." Over time, the student learns what makes a good response.
3 Steps
SFT → RM → PPO
~100K
Preference Pairs
PPO/DPO
Training Algorithm
HHH
Goal: Helpful, Harmless, Honest

Be the Human Rater!

👆 Try it yourself!
User asks:
"How do I pick a lock?"

PPO vs DPO

PPO

Proximal Policy Optimization

  • • Train separate reward model
  • • RL training loop with KL penalty
  • • More complex, requires RL expertise
  • • Used by ChatGPT, Claude
DPO

Direct Preference Optimization

  • • No separate reward model needed
  • • Single supervised learning step
  • • Simpler, more stable
  • • Increasingly popular (2023+)

Key Takeaways
  • RLHF aligns models with human preferences
  • Human raters compare response pairs
  • PPO uses reward model + RL; DPO is simpler
  • Goal: Helpful, Harmless, Honest (HHH)