Phase 1: Foundation~6 minintermediate

📐Architecture Planning

Blueprint for Intelligence

Planning compute requirements, team structure, infrastructure, and making critical architectural decisions.

Compute PlanningTeam StructureInfrastructureCost Estimation

Model Architecture Decisions

Before writing a single line of code, you need to make fundamental decisions about your model's architecture. These choices will determine everything from training cost to final capabilities.

Building a Skyscraper

Think of model architecture like designing a building. You need to decide: how tall (parameters), how wide (hidden dimensions), how many floors (layers), and how the elevators work (attention mechanisms). Once construction starts, major changes become extremely expensive.

Interactive Cost Estimator

👆 Try it yourself!
Model Size70B parameters
1B (small)70B (medium)400B (large)
Context Length8K tokens
2K32K128K
~140
H100 GPUs
~35 days
Training Time
$353K
Est. Cost

GPU Infrastructure

LLM training requires massive GPU clusters. The choice of GPU affects training speed, memory constraints, and cost efficiency.

GPUTFLOPs (FP16)MemoryCloud Cost
NVIDIA A100 (40GB)31240 GB$2/hr
NVIDIA A100 (80GB)31280 GB$3/hr
NVIDIA H100 (80GB)98980 GB$4/hr
NVIDIA H200989141 GB$5/hr
H100 vs A100
The H100 delivers ~3x the training performance of A100 for LLMs due to improved Transformer Engine and higher memory bandwidth. For new projects, H100s are typically more cost-effective despite higher hourly rates.

Team Structure

Building an LLM from scratch requires a multidisciplinary team. Here's a typical composition for a frontier model effort:

ML Research Scientists

5-20

Design architecture, run experiments

ML Engineers

10-50

Build training infrastructure, optimize

Data Engineers

5-15

Data pipelines, quality, preprocessing

Infrastructure/DevOps

5-20

GPU clusters, networking, storage

Safety & Alignment

3-10

Red teaming, RLHF, evaluations

Product/PM

2-5

Roadmap, requirements, stakeholders

50-150
Total Team Size
6-18
Months Timeline
$10-100M
Annual Budget
24/7
Operations

Key Architecture Choices

Decoder-only vs Encoder-Decoder

Most modern LLMs use decoder-only architecture (GPT-style). It's simpler, scales better, and works well for generation tasks.

Recommendation: Decoder-only

Attention Mechanism

Multi-Head Attention (MHA), Grouped-Query Attention (GQA), or Multi-Query Attention (MQA). GQA offers a good balance of quality and inference speed.

Recommendation: GQA for production

Positional Encoding

RoPE (Rotary Position Embedding) is the current standard. It enables context length extension after training and has good extrapolation.

Recommendation: RoPE

Normalization

Pre-LayerNorm (Pre-LN) with RMSNorm is standard. More stable training than Post-LN, and RMSNorm is faster than full LayerNorm.

Recommendation: Pre-LN with RMSNorm

Key Takeaways
  • Model size determines compute requirements exponentially
  • H100 GPUs offer best price/performance for training
  • Plan for 50-150 person team for frontier models
  • Use decoder-only, GQA, RoPE, and RMSNorm as baseline