Breaking Language into Pieces
Computers can't read text like humans — they only understand numbers. Tokenization is the process of converting text into a sequence of integers that the model can process. Each unique piece of text (a "token") gets assigned a number from the vocabulary.
Like Building Blocks
Think of tokens like LEGO bricks. You can build almost anything by combining a fixed set of standardized pieces. Similarly, a tokenizer breaks any text into a sequence of ~50,000 standard pieces, which can then be combined to represent any possible text.
~50K
Vocabulary Size
~0.75
Tokens per Word
BPE
Most Common Algo
~4 chars
Avg Token Length
Interactive Tokenizer
👆 Try it yourself!
"Hello, world!"
Try Your Own Text
How BPE Works
Byte-Pair Encoding (BPE) is the most common tokenization algorithm. It works by starting with individual characters, then iteratively merging the most frequent pairs.
Step 1t h e → Start with characters
Step 2th e → Merge "t" + "h" (most frequent)
Step 3the → Merge "th" + "e"
Result: "the" becomes a single token instead of 3 characters
Tokenizer Comparison
BPEGPT, Llama
Byte-Pair Encoding
Merges frequent byte pairs iteratively
WordPieceBERT, DistilBERT
WordPiece
Likelihood-based subword merging
SentencePieceT5, ALBERT
SentencePiece
Language-agnostic, includes BPE/Unigram
TiktokenGPT-3.5/4
Tiktoken
OpenAI's fast BPE implementation
✅
Key Takeaways
- Tokenization converts text to integers the model understands
- BPE is the dominant algorithm, balancing efficiency and coverage
- Vocabulary size is typically ~32K-100K tokens
- Common words are single tokens; rare words are split