Tokenizer Explorer

See how different models break text into tokens

TYPE

BPE

VOCAB SIZE

50,257

HOW IT WORKS

Byte-Pair Encoding - learns common character sequences

Select a tokenizer to start

➜ input_text:

Try:

LLMs don't see text like we do. They process sequences of tokens - pieces of text that might be words, parts of words, or even single characters.

Unknown words split with ## prefix: "playing" → "play" + "##ing"

Learns frequent byte pairs. Spaces often become Ġ at word starts.

Uses ▁ for spaces. Works directly on raw text without pre-tokenization.

Pro tip: Fewer tokens = faster & cheaper inference. That's why "hello" (1 token) costs less than "supercalifragilistic" (multiple tokens).