Cleaning the Chaos
Raw internet data is incredibly messy. Before training, we must clean, filter, and deduplicate the data โ often removing 70-90% of what we collected. Quality matters far more than quantity.
Panning for Gold
Imagine you've collected tons of river sand hoping to find gold. Most of it is worthless dirt, rocks, and debris. Data cleaning is like carefully sifting through everything to extract only the valuable nuggets โ the high-quality text that will actually teach your model something useful.
70-90%
Data Removed
~30%
Duplicates
99.9%
PII Recall Target
Weeks
Processing Time
Cleaning Pipeline Simulator
๐ Try it yourself!
Before Cleaning
<html><head><title>Buy Cheap Pills Online!!!</title></head><body> <nav>Home | About | Contact</nav> <div class="ad">CLICK HERE FOR FREE IPHONE</div> <p>The quick brown fox jumps over the lazy dog.</p> <p>The quick brown fox jumps over the lazy dog.</p> <p>Call John at 555-123-4567 or email john@email.com</p> <footer>ยฉ 2024 | Privacy Policy</footer> </body></html>
After Cleaning
The quick brown fox jumps over the lazy dog.
Quality Filtering Techniques
Perplexity Filtering
Use a small language model to score text quality. High perplexity indicates unusual/low-quality text. Remove documents above threshold.
MinHash Deduplication
Convert documents to hash signatures, find near-duplicates efficiently. Removes ~30% of Common Crawl data.
Classifier-based Filtering
Train classifiers to identify spam, adult content, hate speech. Often combined with blocklist approaches.
โ
Key Takeaways
- Expect to remove 70-90% of raw collected data
- Deduplication alone removes ~30% of web data
- Quality filtering using perplexity scores is highly effective
- PII removal is critical for legal and ethical compliance