Scaling Laws | LLM Simulator

The Science of Getting Bigger

One of the most surprising discoveries in AI is that language model performance follows predictable scaling laws. Double the compute, and you get a predictable improvement. This allows researchers to plan massive training runs with confidence.

Like Moore's Law for AI

Just as Moore's Law predicted transistor density doubling every 2 years, scaling laws predict how model quality improves with more data, parameters, and compute. It's remarkably consistent across orders of magnitude.

The Chinchilla Insight

The 2022 Chinchilla paper showed that many models were undertrained. For optimal performance, you should scale parameters AND data together. The rule of thumb: ~20 tokens per parameter.

Optimal Tokens ≈ 20 × Parameters

Parameters

→ 140B tokens

70B

Parameters

→ 1.4T tokens

175B

Parameters

→ 3.5T tokens

Emergent Abilities

At certain scale thresholds, models suddenly gain new capabilities that weren't present at smaller scales. These "emergent abilities" include multi-step reasoning, chain-of-thought, and in-context learning.

~1BBasic language understanding, simple Q&A

~10BFew-shot learning, basic reasoning

~100BChain-of-thought, complex reasoning, code generation

~1TMulti-modal, agentic capabilities, expert-level tasks

✅

Key Takeaways

Scaling laws allow prediction of model performance
Chinchilla: Use ~20 tokens per parameter for optimal training
Emergent abilities appear at scale thresholds
Compute-optimal training balances parameters and data

📈Scaling Laws

The Science of Getting Bigger

Like Moore's Law for AI

The Chinchilla Insight

Technical Deep Dive

Emergent Abilities