What is an LLM Context Window?
The context window is the maximum number of tokens a language model can attend to in a single forward pass. It bounds everything the model can "see" at once: the system prompt, the conversation history, retrieved documents, tool definitions, tool responses, and the response the model is generating. Bigger context window = the model can reason over more information at once. Smaller context window = sharper attention, less compute per token.
Token counts vs. characters or words
Tokens are the units the model actually processes. They roughly correspond to:
- English text — 1 token ≈ 4 characters ≈ 0.75 words. A 1,000-token chunk is ~750 English words.
- Code — slightly more tokens per character (variable names, indentation eat tokens)
- Non-Latin scripts — Chinese, Japanese, Arabic typically use more tokens per character with most tokenizers, fewer with multilingual ones
A modern model with a 200K-token context window can hold roughly a 500-page book at once.
Current model context windows (2026)
| Model family | Context window |
|---|---|
| Claude Sonnet 4.6, Opus 4.6 | 200K (1M for 4.5 in special modes) |
| GPT-5.2 | 256K |
| Gemini 2.5 | 1M (2M experimental) |
| Open-source (Llama 3.1, Mistral) | 128K typical |
Numbers update fast. The trend is up.
Why context windows matter for security
Three security implications:
-
Larger windows = larger attack surface. Indirect prompt injection scales with how much retrieved content the model reads. A 1M-token window means a 1M-token attack surface per turn.
-
Context-stuffing attacks. Pad the context with hundreds of fake assistant turns showing harmful answers ("many-shot jailbreaking"), then ask the real question. Larger windows make this attack more practical.
-
Lost-in-the-middle and edge attacks. Models attend non-uniformly to context — typically strongest at the very start and very end, weakest in the middle. Attackers can place injection payloads in the high-attention zones (top of system prompt, end of latest tool response) for maximum effect.
Practical limits
The advertised window is the maximum. In practice:
- Latency scales with context — large contexts are slow, especially at first-token latency
- Cost scales with context — both input and output tokens are billed; a 1M-token request costs accordingly
- Quality degrades at the edges — most models perform best on the first ~50K tokens and last ~10K tokens of their nominal window
- The output budget shares the window — if your input is 195K and the window is 200K, you have 5K tokens for the response
For RAG pipelines, the right number of retrieved chunks is rarely "as many as fit" — it's "the smallest number that contains the answer," because relevance density beats raw token count.