Architecture Planning | LLM Simulator

Model Architecture Decisions

Before writing a single line of code, you need to make fundamental decisions about your model's architecture. These choices will determine everything from training cost to final capabilities.

Building a Skyscraper

Think of model architecture like designing a building. You need to decide: how tall (parameters), how wide (hidden dimensions), how many floors (layers), and how the elevators work (attention mechanisms). Once construction starts, major changes become extremely expensive.

Interactive Cost Estimator

👆 Try it yourself!

Model Size70B parameters

1B (small)70B (medium)400B (large)

Context Length8K tokens

2K32K128K

~140

H100 GPUs

~35 days

Training Time

$353K

Est. Cost

GPU Infrastructure

LLM training requires massive GPU clusters. The choice of GPU affects training speed, memory constraints, and cost efficiency.

GPU	TFLOPs (FP16)	Memory	Cloud Cost
NVIDIA A100 (40GB)	312	40 GB	$2/hr
NVIDIA A100 (80GB)	312	80 GB	$3/hr
NVIDIA H100 (80GB)	989	80 GB	$4/hr
NVIDIA H200	989	141 GB	$5/hr

H100 vs A100

The H100 delivers ~3x the training performance of A100 for LLMs due to improved Transformer Engine and higher memory bandwidth. For new projects, H100s are typically more cost-effective despite higher hourly rates.

Team Structure

Building an LLM from scratch requires a multidisciplinary team. Here's a typical composition for a frontier model effort:

ML Research Scientists

5-20

Design architecture, run experiments

ML Engineers

10-50

Build training infrastructure, optimize

Data Engineers

5-15

Data pipelines, quality, preprocessing

Infrastructure/DevOps

5-20

GPU clusters, networking, storage

Safety & Alignment

3-10

Red teaming, RLHF, evaluations

Product/PM

2-5

Roadmap, requirements, stakeholders

50-150

Total Team Size

6-18

Months Timeline

$10-100M

Annual Budget

24/7

Operations

Key Architecture Choices

Decoder-only vs Encoder-Decoder

Most modern LLMs use decoder-only architecture (GPT-style). It's simpler, scales better, and works well for generation tasks.

Recommendation: Decoder-only

Attention Mechanism

Multi-Head Attention (MHA), Grouped-Query Attention (GQA), or Multi-Query Attention (MQA). GQA offers a good balance of quality and inference speed.

Recommendation: GQA for production

Positional Encoding

RoPE (Rotary Position Embedding) is the current standard. It enables context length extension after training and has good extrapolation.

Recommendation: RoPE

Normalization

Pre-LayerNorm (Pre-LN) with RMSNorm is standard. More stable training than Post-LN, and RMSNorm is faster than full LayerNorm.

Recommendation: Pre-LN with RMSNorm

✅

Key Takeaways

Model size determines compute requirements exponentially
H100 GPUs offer best price/performance for training
Plan for 50-150 person team for frontier models
Use decoder-only, GQA, RoPE, and RMSNorm as baseline

📐Architecture Planning