Glossary/Tokenization

What is Tokenization in Large Language Models?

Tokenization is the process of splitting raw text into discrete units — tokens — that the model actually reads and generates. Tokens are usually subword fragments, not whole words. A model never sees "tokenization"; it sees something like [token, ization], two integer IDs from a vocabulary of ~100K to 200K entries. This split happens before the model and is invisible to it. Most LLM-specific security oddities live at this layer.

How tokenization works

Modern models use byte-pair encoding (BPE) or close variants. The tokenizer is trained once on a large corpus to find the most common character sequences and assign them token IDs. Common sequences become single tokens; rare sequences split into multiple tokens.

Side effects of how tokenizers actually work:

Why tokenization is a security boundary

Three classes of attack hide at the tokenization layer:

  1. Encoding-based jailbreaks. A harmful request encoded in base64, Unicode variation selectors, leetspeak, or zero-width characters tokenizes differently than its plaintext form. Safety classifiers that inspect the pre-tokenization string see one thing; the model that processes the tokens sees another.

  2. Tokenizer-classifier mismatch. Many guardrails run their own tokenizer, then classify. If the guardrail's tokenizer normalizes Unicode (stripping variation selectors) but the model's tokenizer preserves them, the guardrail's classification doesn't match what the model actually receives. Repello's emoji prompt injection research demonstrated this in production guardrail products.

  3. Glitch tokens. Some tokens — typically rare strings the model saw little training signal for — produce wildly off-distribution behavior when included. SolidGoldMagikarp is the famous historical example. Most have been patched, but the class still exists.

Practical implications