Making It Fast and Small
A 70B model in FP16 requires 140GB of memory — far too much for most deployment scenarios. Optimization techniques like quantization and pruning make models smaller and faster while preserving quality.
Packing for a Trip
Optimization is like packing a suitcase. The full wardrobe (FP16) won't fit, so you pick versatile items (quantization), leave behind rarely-worn clothes (pruning), and roll everything tight (efficient formats). You still look great with 1/4 the luggage.
4x
Size Reduction (INT4)
2-3x
Speed Increase
~2%
Quality Loss
Consumer GPU
Now Possible
Quantization Levels
FP16
Size: 140 GBSpeed: 1xQuality: 100%
INT8
Size: 70 GBSpeed: 1.5xQuality: 99.5%
INT4
Size: 35 GBSpeed: 2xQuality: 98%
INT2
Size: 17 GBSpeed: 3xQuality: 90%
Other Optimization Techniques
✂️ Pruning
Remove unimportant weights (near-zero values). Structured pruning removes entire neurons/heads for actual speedup.
🎓 Distillation
Train a smaller "student" model to mimic a larger "teacher." Can achieve 90% quality at 10% size.
🔗 Speculative Decoding
Draft with small model, verify with large model. 2-3x speedup with identical outputs.
📦 KV Cache Compression
Quantize or evict old KV cache entries. Critical for long-context inference.
✅
Key Takeaways
- INT4 quantization cuts size by 4x with ~2% quality loss
- GPTQ and AWQ are most popular quantization methods
- Distillation creates smaller models that match large ones
- Speculative decoding gives free speedup