← cd ../apps
Tokenizer Explorer
See how different models break text into tokens
TYPE
BPE
VOCAB SIZE
50,257
HOW IT WORKS
Byte-Pair Encoding - learns common character sequences
Select a tokenizer to start
Try:
// why_tokenization_matters
LLMs don't see text like we do. They process sequences of tokens - pieces of text that might be words, parts of words, or even single characters.
WordPiece (BERT)
Unknown words split with ## prefix: "playing" → "play" + "##ing"
BPE (GPT-2/3/4)
Learns frequent byte pairs. Spaces often become Ġ at word starts.
SentencePiece (LLaMA)
Uses ▁ for spaces. Works directly on raw text without pre-tokenization.
Pro tip: Fewer tokens = faster & cheaper inference. That's why "hello" (1 token) costs less than "supercalifragilistic" (multiple tokens).