State-of-the-Art Improvements
The original 2017 Transformer has been significantly improved. Modern LLMs incorporate numerous optimizations that enable longer context, faster training, and more efficient inference.
Tuning a Race Car
Key Innovations
IO-aware attention that reduces memory reads/writes
Rotary Position Embedding for relative positions
Grouped-Query Attention shares KV heads
Mixture of Experts activates subset of weights
Gated activation function combining Swish + GLU
Simplified normalization without mean centering
Flash Attention Deep Dive
Flash Attention is arguably the most impactful recent optimization. Standard attention is memory-bound ā it reads/writes the full attention matrix to GPU memory. Flash Attention uses tiling and recomputation to keep everything in fast SRAM.
Mixture of Experts (MoE)
MoE models like Mixtral have 8x the parameters but only activate a fraction per token. A "router" network decides which expert(s) process each token.
- Flash Attention enables 2-4x speedup and longer context
- RoPE allows context extension beyond training length
- GQA reduces inference costs with minimal quality impact
- MoE scales parameters without proportional compute increase