Phase 2: Data~7 minbeginner

🌍Data Collection

Gathering the World's Knowledge

Crawling the web, licensing books, collecting code repositories, and assembling the training corpus.

Web CrawlingCommon CrawlData SourcesLicensing

Gathering the World's Knowledge

Every LLM starts with data — vast amounts of it. The quality and diversity of your training data directly determines what your model can learn. Modern LLMs are trained on trillions of tokens from across the internet and beyond.

Filling the World's Biggest Library

Imagine trying to build a brain that knows everything. First, you'd need to read every book ever written, every website, every conversation. That's what data collection is — assembling humanity's collective knowledge into a format a computer can learn from.
~15T
Tokens (typical)
~45TB
Raw Text
300+
Languages
~8B
Web Pages

Data Source Simulator

👆 Try it yourself!

Major Data Sources

🌐

Common Crawl

~400TB

Open web crawl archive

~3B pages/month

📚

Books

~50GB

Public domain + licensed

~200K titles

📰

News & Articles

~100GB

News archives, magazines

~500M articles

💬

Forums & Q&A

~200GB

Reddit, StackOverflow

~2B posts

🔬

Scientific Papers

~80GB

arXiv, PubMed, Semantic Scholar

~90M papers

💻

Code

~500GB

GitHub, GitLab

~50M repos

📖

Wikipedia

~20GB

All languages

~60M articles

🎭

Creative Writing

~10GB

Poetry, lyrics, stories

~10M works

Data Licensing Considerations

⚠️
Legal Complexity
Data licensing for LLMs is legally complex and rapidly evolving. Major considerations include copyright, fair use, robots.txt compliance, and regional regulations (GDPR, CCPA). Many companies face ongoing lawsuits over training data usage.

✅ Safer Sources

  • • Public domain books
  • • Open-access papers (arXiv)
  • • Permissively licensed code
  • • Wikipedia (CC-BY-SA)
  • • Government documents

⚠️ Riskier Sources

  • • Copyrighted books
  • • Paywalled articles
  • • Social media (ToS issues)
  • • Proprietary code
  • • Personal data/PII
Key Takeaways
  • LLMs require trillions of tokens from diverse sources
  • Common Crawl is the foundation but needs heavy filtering
  • Data quality matters more than raw quantity
  • Licensing is a major legal consideration