What is Model Hijacking?
Model hijacking is an attack where an adversary repurposes a deployed AI model to perform tasks the model owner did not authorize, without retraining or modifying the model. Where prompt injection focuses on the technique of smuggling instructions into the model, hijacking focuses on the outcome: the deployed model becomes unauthorized compute for the attacker, often at the owner's expense.
How model hijacking works
A deployed model has an intended task — answer customer questions about a SaaS product, summarize support tickets, translate text, etc. The model owner pays for inference, defines a system prompt, and exposes the model through a public or semi-public interface (chatbot widget, API, embedded assistant).
A model hijack happens when an attacker uses prompt injection, jailbreak techniques, or context-window exploits to make that model perform a different task entirely. The model is unchanged; the application's framing of it is bypassed; the model now serves the attacker's goal.
The attacker's gain is twofold:
- Free capability access — they get GPT-4-class or Claude-class capability on the victim's bill
- Privacy laundering — their queries flow through the victim's infrastructure, evading the model provider's per-account abuse detection
Documented hijack patterns
- The free GPT-via-customer-support chatbot. Public-facing chatbots fronting a foundation model get repurposed for general queries. r/ChatGPTJailbreak and similar communities have catalogued dozens of production chatbots reduced to free-tier alternatives.
- The code-writing translation API. Any text-to-text API can be coerced into general-purpose generation if its system prompt is bypassable.
- The RAG bot turned data exfil channel. Support assistants with retrieval access become tools for the attacker to extract internal documents the bot has access to but its public users shouldn't see.
- The jailbreak-as-a-service marketplace. Some attackers chain hijacked models into proxy services and resell access — the original owner sees a sustained spike in inference costs from a single IP range.
Why it's a costly attack
Unlike traditional API abuse, hijacked-model traffic looks legitimate at the network and request layer. Each request is a well-formed prompt to a real endpoint. Detection requires looking at the content of the conversation, not the request envelope.
Inference costs for foundation models are non-trivial — a sustained hijack against a high-traffic chatbot can run thousands of dollars per day in unauthorized inference fees, plus reputation damage when the misuse becomes public.
Defending against model hijacking
- Narrow system prompts. A chatbot defined as "customer support agent for X SaaS product, refuse anything off-topic" hijacks much harder than one defined as "helpful assistant."
- Output-domain validation. Runtime guardrails should detect when responses drift outside the deployment's intended subject matter and either block or escalate.
- Per-user abuse detection. High token consumption, repetitive jailbreak patterns, or sudden topic drift in a single session are reliable signals.
- Continuous adversarial testing. The system prompt that resists today's jailbreaks won't resist next month's; treat hijack-resistance as a regression-tested property.