Phase 4: Training~9 minintermediate

📖Pre-training Process

Learning to Predict

Next-token prediction, loss functions, optimizers, learning rate schedules, and checkpointing.

Causal LMAdamW OptimizerCosine ScheduleGradient Checkpointing

Learning to Predict

Pre-training is where the magic happens. The model learns one simple task: predict the next token. By doing this trillions of times across diverse text, it learns language, facts, reasoning, and more.

The World's Longest Fill-in-the-Blank Test

Imagine taking a test with trillions of questions, each asking "What comes next?" At first, you guess randomly. But after billions of examples, patterns emerge. You learn that "The sky is..." is usually followed by "blue" not "banana."
~2T
Training Tokens
~3 months
Training Time
$100M+
Compute Cost
24/7
GPU Utilization

Play the Prediction Game!

👆 Try it yourself!
Question 1/3Score: 0/0
The capital of France is ___

The Loss Curve

👆 Try it yourself!

✅
Key Takeaways
  • Pre-training is next-token prediction at massive scale
  • Loss decreases as the model learns patterns
  • AdamW with cosine LR schedule is standard
  • Checkpointing every few hours protects against failures