Phase 6: Evaluation~8 minintermediate

📊Benchmarking & Evals

Measuring Intelligence

MMLU, HumanEval, MT-Bench, safety evaluations, and comprehensive model assessment.

MMLUHumanEvalMT-BenchRed Teaming

Measuring Intelligence

How do you know if your model is actually good? Benchmarksprovide standardized tests to compare models objectively. They cover knowledge, reasoning, coding, math, and safety.

Standardized Testing for AI

Just like students take SATs and GREs, AI models take standardized tests. MMLU is like a comprehensive exam covering 57 subjects. HumanEval tests coding ability. These scores let us compare models fairly.

Key Benchmarks

MMLUKnowledge
86.4%

57 subjects from STEM to humanities

HumanEvalCoding
67%

Python function completion

GSM8KMath
92%

Grade school math word problems

HellaSwagReasoning
95.3%

Commonsense inference

TruthfulQASafety
59%

Truthfulness on tricky questions

MT-BenchChat
9/10

Multi-turn conversation quality

Safety Evaluations

🔴 Red Teaming

Human experts try to elicit harmful outputs through adversarial prompts, jailbreaks, and edge cases.

🛡️ Automated Attacks

Tools like GCG, AutoDAN generate adversarial prompts at scale to test robustness.

⚖️ Bias Testing

Measure disparities in responses across demographics, topics, and perspectives.

🔍 Capability Elicitation

Probe for dangerous capabilities: bio/cyber weapons, deception, self-preservation.

Key Takeaways
  • Benchmarks enable objective model comparison
  • MMLU, HumanEval, MT-Bench are most cited
  • Safety evals require both automated and human testing
  • Benchmark contamination is a growing concern