Learning to Predict
Pre-training is where the magic happens. The model learns one simple task: predict the next token. By doing this trillions of times across diverse text, it learns language, facts, reasoning, and more.
The World's Longest Fill-in-the-Blank Test
Imagine taking a test with trillions of questions, each asking "What comes next?" At first, you guess randomly. But after billions of examples, patterns emerge. You learn that "The sky is..." is usually followed by "blue" not "banana."
~2T
Training Tokens
~3 months
Training Time
$100M+
Compute Cost
24/7
GPU Utilization
Play the Prediction Game!
👆 Try it yourself!
Question 1/3Score: 0/0
The capital of France is ___
The Loss Curve
👆 Try it yourself!
✅
Key Takeaways
- Pre-training is next-token prediction at massive scale
- Loss decreases as the model learns patterns
- AdamW with cosine LR schedule is standard
- Checkpointing every few hours protects against failures