Gathering the World's Knowledge
Every LLM starts with data — vast amounts of it. The quality and diversity of your training data directly determines what your model can learn. Modern LLMs are trained on trillions of tokens from across the internet and beyond.
Filling the World's Biggest Library
Imagine trying to build a brain that knows everything. First, you'd need to read every book ever written, every website, every conversation. That's what data collection is — assembling humanity's collective knowledge into a format a computer can learn from.
~15T
Tokens (typical)
~45TB
Raw Text
300+
Languages
~8B
Web Pages
Data Source Simulator
👆 Try it yourself!
Major Data Sources
🌐
Common Crawl
~400TBOpen web crawl archive
~3B pages/month
📚
Books
~50GBPublic domain + licensed
~200K titles
📰
News & Articles
~100GBNews archives, magazines
~500M articles
💬
Forums & Q&A
~200GBReddit, StackOverflow
~2B posts
🔬
Scientific Papers
~80GBarXiv, PubMed, Semantic Scholar
~90M papers
💻
Code
~500GBGitHub, GitLab
~50M repos
📖
Wikipedia
~20GBAll languages
~60M articles
🎭
Creative Writing
~10GBPoetry, lyrics, stories
~10M works
Data Licensing Considerations
⚠️
Legal Complexity
Data licensing for LLMs is legally complex and rapidly evolving. Major considerations include copyright, fair use, robots.txt compliance, and regional regulations (GDPR, CCPA). Many companies face ongoing lawsuits over training data usage.
✅ Safer Sources
- • Public domain books
- • Open-access papers (arXiv)
- • Permissively licensed code
- • Wikipedia (CC-BY-SA)
- • Government documents
⚠️ Riskier Sources
- • Copyrighted books
- • Paywalled articles
- • Social media (ToS issues)
- • Proprietary code
- • Personal data/PII
✅
Key Takeaways
- LLMs require trillions of tokens from diverse sources
- Common Crawl is the foundation but needs heavy filtering
- Data quality matters more than raw quantity
- Licensing is a major legal consideration