The Science of Getting Bigger
One of the most surprising discoveries in AI is that language model performance follows predictable scaling laws. Double the compute, and you get a predictable improvement. This allows researchers to plan massive training runs with confidence.
Like Moore's Law for AI
The Chinchilla Insight
The 2022 Chinchilla paper showed that many models were undertrained. For optimal performance, you should scale parameters AND data together. The rule of thumb: ~20 tokens per parameter.
Emergent Abilities
At certain scale thresholds, models suddenly gain new capabilities that weren't present at smaller scales. These "emergent abilities" include multi-step reasoning, chain-of-thought, and in-context learning.
- Scaling laws allow prediction of model performance
- Chinchilla: Use ~20 tokens per parameter for optimal training
- Emergent abilities appear at scale thresholds
- Compute-optimal training balances parameters and data