← cd ../apps

Tokenizer Explorer

See how different models break text into tokens

TYPE

BPE

VOCAB SIZE

50,257

HOW IT WORKS

Byte-Pair Encoding - learns common character sequences

Select a tokenizer to start
Try:

// why_tokenization_matters

LLMs don't see text like we do. They process sequences of tokens - pieces of text that might be words, parts of words, or even single characters.

WordPiece (BERT)

Unknown words split with ## prefix: "playing" → "play" + "##ing"

BPE (GPT-2/3/4)

Learns frequent byte pairs. Spaces often become Ġ at word starts.

SentencePiece (LLaMA)

Uses ▁ for spaces. Works directly on raw text without pre-tokenization.

Pro tip: Fewer tokens = faster & cheaper inference. That's why "hello" (1 token) costs less than "supercalifragilistic" (multiple tokens).