Pre-training Process | LLM Simulator

Learning to Predict

Pre-training is where the magic happens. The model learns one simple task: predict the next token. By doing this trillions of times across diverse text, it learns language, facts, reasoning, and more.

The World's Longest Fill-in-the-Blank Test

Imagine taking a test with trillions of questions, each asking "What comes next?" At first, you guess randomly. But after billions of examples, patterns emerge. You learn that "The sky is..." is usually followed by "blue" not "banana."

~2T

Training Tokens

~3 months

Training Time

$100M+

Compute Cost

24/7

GPU Utilization

Play the Prediction Game!

👆 Try it yourself!

Question 1/3Score: 0/0

The capital of France is ___

The Loss Curve

👆 Try it yourself!

✅

Key Takeaways

Pre-training is next-token prediction at massive scale
Loss decreases as the model learns patterns
AdamW with cosine LR schedule is standard
Checkpointing every few hours protects against failures

📖Pre-training Process

Learning to Predict

The World's Longest Fill-in-the-Blank Test

Play the Prediction Game!

The Loss Curve

Technical Deep Dive