Model Architecture Decisions
Before writing a single line of code, you need to make fundamental decisions about your model's architecture. These choices will determine everything from training cost to final capabilities.
Building a Skyscraper
Interactive Cost Estimator
GPU Infrastructure
LLM training requires massive GPU clusters. The choice of GPU affects training speed, memory constraints, and cost efficiency.
| GPU | TFLOPs (FP16) | Memory | Cloud Cost |
|---|---|---|---|
| NVIDIA A100 (40GB) | 312 | 40 GB | $2/hr |
| NVIDIA A100 (80GB) | 312 | 80 GB | $3/hr |
| NVIDIA H100 (80GB) | 989 | 80 GB | $4/hr |
| NVIDIA H200 | 989 | 141 GB | $5/hr |
Team Structure
Building an LLM from scratch requires a multidisciplinary team. Here's a typical composition for a frontier model effort:
ML Research Scientists
5-20Design architecture, run experiments
ML Engineers
10-50Build training infrastructure, optimize
Data Engineers
5-15Data pipelines, quality, preprocessing
Infrastructure/DevOps
5-20GPU clusters, networking, storage
Safety & Alignment
3-10Red teaming, RLHF, evaluations
Product/PM
2-5Roadmap, requirements, stakeholders
Key Architecture Choices
Decoder-only vs Encoder-Decoder
Most modern LLMs use decoder-only architecture (GPT-style). It's simpler, scales better, and works well for generation tasks.
Recommendation: Decoder-onlyAttention Mechanism
Multi-Head Attention (MHA), Grouped-Query Attention (GQA), or Multi-Query Attention (MQA). GQA offers a good balance of quality and inference speed.
Recommendation: GQA for productionPositional Encoding
RoPE (Rotary Position Embedding) is the current standard. It enables context length extension after training and has good extrapolation.
Recommendation: RoPENormalization
Pre-LayerNorm (Pre-LN) with RMSNorm is standard. More stable training than Post-LN, and RMSNorm is faster than full LayerNorm.
Recommendation: Pre-LN with RMSNorm- Model size determines compute requirements exponentially
- H100 GPUs offer best price/performance for training
- Plan for 50-150 person team for frontier models
- Use decoder-only, GQA, RoPE, and RMSNorm as baseline