Measuring Intelligence
How do you know if your model is actually good? Benchmarksprovide standardized tests to compare models objectively. They cover knowledge, reasoning, coding, math, and safety.
Standardized Testing for AI
Key Benchmarks
57 subjects from STEM to humanities
Python function completion
Grade school math word problems
Commonsense inference
Truthfulness on tricky questions
Multi-turn conversation quality
Safety Evaluations
🔴 Red Teaming
Human experts try to elicit harmful outputs through adversarial prompts, jailbreaks, and edge cases.
🛡️ Automated Attacks
Tools like GCG, AutoDAN generate adversarial prompts at scale to test robustness.
⚖️ Bias Testing
Measure disparities in responses across demographics, topics, and perspectives.
🔍 Capability Elicitation
Probe for dangerous capabilities: bio/cyber weapons, deception, self-preservation.
- Benchmarks enable objective model comparison
- MMLU, HumanEval, MT-Bench are most cited
- Safety evals require both automated and human testing
- Benchmark contamination is a growing concern