Tokenization | LLM Simulator

Breaking Language into Pieces

Computers can't read text like humans — they only understand numbers. Tokenization is the process of converting text into a sequence of integers that the model can process. Each unique piece of text (a "token") gets assigned a number from the vocabulary.

Like Building Blocks

Think of tokens like LEGO bricks. You can build almost anything by combining a fixed set of standardized pieces. Similarly, a tokenizer breaks any text into a sequence of ~50,000 standard pieces, which can then be combined to represent any possible text.

~50K

Vocabulary Size

~0.75

Tokens per Word

BPE

Most Common Algo

~4 chars

Avg Token Length

Interactive Tokenizer

👆 Try it yourself!

"Hello, world!"

Try Your Own Text

How BPE Works

Byte-Pair Encoding (BPE) is the most common tokenization algorithm. It works by starting with individual characters, then iteratively merging the most frequent pairs.

Step 1t h e → Start with characters

Step 2th e → Merge "t" + "h" (most frequent)

Step 3the → Merge "th" + "e"

Result: "the" becomes a single token instead of 3 characters

Tokenizer Comparison

BPEGPT, Llama

Byte-Pair Encoding

Merges frequent byte pairs iteratively

WordPieceBERT, DistilBERT

WordPiece

Likelihood-based subword merging

SentencePieceT5, ALBERT

SentencePiece

Language-agnostic, includes BPE/Unigram

TiktokenGPT-3.5/4

Tiktoken

OpenAI's fast BPE implementation

✅

Key Takeaways

Tokenization converts text to integers the model understands
BPE is the dominant algorithm, balancing efficiency and coverage
Vocabulary size is typically ~32K-100K tokens
Common words are single tokens; rare words are split

✂️Tokenization