What is Tokenization in Large Language Models?
Tokenization is the process of splitting raw text into discrete units — tokens — that the model actually reads and generates. Tokens are usually subword fragments, not whole words. A model never sees "tokenization"; it sees something like [token, ization], two integer IDs from a vocabulary of ~100K to 200K entries. This split happens before the model and is invisible to it. Most LLM-specific security oddities live at this layer.
How tokenization works
Modern models use byte-pair encoding (BPE) or close variants. The tokenizer is trained once on a large corpus to find the most common character sequences and assign them token IDs. Common sequences become single tokens; rare sequences split into multiple tokens.
Side effects of how tokenizers actually work:
- Whitespace and punctuation are usually attached to the next word.
catandcatand,catare three different tokens. - Numbers tokenize unpredictably. "12345" might be one token, two tokens, or five — depending on the vocabulary. This is why LLMs are bad at arithmetic.
- Non-Latin scripts often tokenize per-character. Chinese, Japanese, Arabic eat far more tokens per piece of meaning than English.
- Emojis and exotic Unicode use multiple tokens. A single visual emoji can become 4-8 tokens after Unicode normalization.
Why tokenization is a security boundary
Three classes of attack hide at the tokenization layer:
-
Encoding-based jailbreaks. A harmful request encoded in base64, Unicode variation selectors, leetspeak, or zero-width characters tokenizes differently than its plaintext form. Safety classifiers that inspect the pre-tokenization string see one thing; the model that processes the tokens sees another.
-
Tokenizer-classifier mismatch. Many guardrails run their own tokenizer, then classify. If the guardrail's tokenizer normalizes Unicode (stripping variation selectors) but the model's tokenizer preserves them, the guardrail's classification doesn't match what the model actually receives. Repello's emoji prompt injection research demonstrated this in production guardrail products.
-
Glitch tokens. Some tokens — typically rare strings the model saw little training signal for — produce wildly off-distribution behavior when included. SolidGoldMagikarp is the famous historical example. Most have been patched, but the class still exists.
Practical implications
- Don't rely on input filters that operate on pre-tokenization text — normalize and re-classify on the same token stream the model will see.
- Audit your tokenizer's Unicode behavior. Test with variation selectors, zero-width joiners, RTL marks, and homoglyph substitutions. Repello's research shows these consistently bypass naïve filters.
- Token counts matter for cost and rate-limiting. A user pasting Mandarin text consumes roughly 2-3x the tokens of equivalent English; rate limits should account for this.