Phase 3: Architecture~6 minadvanced

📈Scaling Laws

The Science of Getting Bigger

Chinchilla scaling laws, compute-optimal training, and understanding the power of scale.

Chinchilla LawsCompute OptimalEmergent AbilitiesModel Sizing

The Science of Getting Bigger

One of the most surprising discoveries in AI is that language model performance follows predictable scaling laws. Double the compute, and you get a predictable improvement. This allows researchers to plan massive training runs with confidence.

Like Moore's Law for AI

Just as Moore's Law predicted transistor density doubling every 2 years, scaling laws predict how model quality improves with more data, parameters, and compute. It's remarkably consistent across orders of magnitude.

The Chinchilla Insight

The 2022 Chinchilla paper showed that many models were undertrained. For optimal performance, you should scale parameters AND data together. The rule of thumb: ~20 tokens per parameter.

Optimal Tokens ≈ 20 × Parameters
7B
Parameters
→ 140B tokens
70B
Parameters
→ 1.4T tokens
175B
Parameters
→ 3.5T tokens

Emergent Abilities

At certain scale thresholds, models suddenly gain new capabilities that weren't present at smaller scales. These "emergent abilities" include multi-step reasoning, chain-of-thought, and in-context learning.

~1BBasic language understanding, simple Q&A
~10BFew-shot learning, basic reasoning
~100BChain-of-thought, complex reasoning, code generation
~1TMulti-modal, agentic capabilities, expert-level tasks
✅
Key Takeaways
  • Scaling laws allow prediction of model performance
  • Chinchilla: Use ~20 tokens per parameter for optimal training
  • Emergent abilities appear at scale thresholds
  • Compute-optimal training balances parameters and data