Phase 2: Data~8 minintermediate

✂️Tokenization

Breaking Language into Pieces

Converting text into tokens using BPE, WordPiece, or SentencePiece algorithms.

BPE AlgorithmVocabulary SizeSubword TokenizationSpecial Tokens

Breaking Language into Pieces

Computers can't read text like humans — they only understand numbers. Tokenization is the process of converting text into a sequence of integers that the model can process. Each unique piece of text (a "token") gets assigned a number from the vocabulary.

Like Building Blocks

Think of tokens like LEGO bricks. You can build almost anything by combining a fixed set of standardized pieces. Similarly, a tokenizer breaks any text into a sequence of ~50,000 standard pieces, which can then be combined to represent any possible text.
~50K
Vocabulary Size
~0.75
Tokens per Word
BPE
Most Common Algo
~4 chars
Avg Token Length

Interactive Tokenizer

👆 Try it yourself!
"Hello, world!"

Try Your Own Text

How BPE Works

Byte-Pair Encoding (BPE) is the most common tokenization algorithm. It works by starting with individual characters, then iteratively merging the most frequent pairs.

Step 1t h e → Start with characters
Step 2th e → Merge "t" + "h" (most frequent)
Step 3the → Merge "th" + "e"
Result: "the" becomes a single token instead of 3 characters

Tokenizer Comparison

BPEGPT, Llama

Byte-Pair Encoding

Merges frequent byte pairs iteratively

WordPieceBERT, DistilBERT

WordPiece

Likelihood-based subword merging

SentencePieceT5, ALBERT

SentencePiece

Language-agnostic, includes BPE/Unigram

TiktokenGPT-3.5/4

Tiktoken

OpenAI's fast BPE implementation

Key Takeaways
  • Tokenization converts text to integers the model understands
  • BPE is the dominant algorithm, balancing efficiency and coverage
  • Vocabulary size is typically ~32K-100K tokens
  • Common words are single tokens; rare words are split