Back to all blogs

|
|
12 min read


TL;DR
Every capability jump in AI models is also a capability jump for attackers using those models as infrastructure or targeting applications built on them.
GPT-5.4's improved long-context handling transforms prompt injections buried deep in large documents from probabilistic bets into consistent, dependable attacks.
At 75% task completion on OSWorld, computer use agents are now reliable enough that goal hijacking via adversarial UI elements actually finishes what it starts.
The attack surface has not expanded. The attack success rate has.
Security capability announcements tend to focus on what defenders gain. Better reasoning, faster inference, longer context, more reliable task execution. These are real improvements, and they matter for the teams building AI-powered products.
They also matter for everyone trying to break them.
The arrival of GPT-5.4 comes with a set of benchmarks that security teams should read carefully: not as a product announcement, but as an updated threat model. When a frontier lab publishes that a model achieves 75% on OSWorld and handles context windows approaching 1M tokens with significantly reduced degradation, two things happen simultaneously. Enterprise AI applications get more capable. And the attacks that have been semi-reliable become highly reliable.
This post covers two specific ways GPT-5.4's reliability improvements shift the attacker's calculus, and what security teams operating agentic AI in production need to account for today.
Why attacker success rates were capped by model unreliability
A useful but underappreciated fact about prompt injection attacks against earlier-generation LLMs: low model reliability was an accidental safety net.
Consider indirect prompt injection, where adversarial instructions are embedded in content the model processes rather than submitted directly by a user. This attack class depends on the model reliably reading and acting on injected instructions. If the model is inconsistent about what it attends to, which instructions it follows, and whether it completes multi-step tasks without drift, the attacker's success rate fluctuates. A malicious instruction injected into a retrieved document might succeed on a small fraction of attempts against one model and on a significantly larger fraction against another: the variance across model generations is wide, and the attacker cannot count on consistent execution.
That inconsistency was not a designed defense. It was a side effect of model limitations. Security teams did not intentionally rely on it, but it was there, absorbing some fraction of real-world attack attempts and preventing them from landing consistently.
Model capability improvements do not come with a carve-out for attacker use cases. When a new model generation improves at following complex multi-step instructions and maintains coherent task execution across longer contexts, it improves for both legitimate workflows and adversarial ones. GPT-5.4 represents a generation where several of those accidental safety nets disappear at once.
1M token context and the sleeper injection problem
Research published by Liu et al. at Stanford documented what became known as the "lost in the middle" problem: language models show degraded performance on information located in the middle of long contexts, with strong primacy and recency effects. Information in the first and last portions of a context window gets attended to reliably; information buried in the middle does not.
For attackers attempting to embed adversarial instructions in large documents, this degradation was a practical constraint. If you inject a malicious instruction into chunk 400 of a 600-chunk document, the model might not act on it. The attack becomes probabilistic in a way that limits its operational value.
RAG pipelines, codebase-wide agents, and long meeting transcript processors now routinely handle contexts at the 500k-1M token range. At these scales, an attacker embedding instructions in a document does not necessarily control where in the retrieved context their payload will land. The "lost in the middle" degradation meant that only injections fortunate enough to land near context boundaries would execute reliably.
GPT-5.4's improved long-context handling removes that constraint. An instruction injected in chunk 847 of a 1000-chunk document gets acted on with the same reliability as one placed at the start. What was a probabilistic attack with inconsistent execution becomes a dependable one.
The implication for RAG-based deployments is direct. Any pipeline that ingests external documents, retrieves from a broad knowledge base, or processes long user-uploaded files now has a larger consistent injection surface than it did with prior model generations. As documented in Repello's research on security threats in agentic AI browsers, the attack surface in these systems extends across every source of content the agent reads, not just what users directly submit.
Computer use agents, adversarial UI, and the 75% completion threshold
The second improvement in GPT-5.4 that security teams should flag is its benchmark performance on computer use tasks. The OSWorld benchmark, developed by researchers at Carnegie Mellon and the University of Hong Kong, evaluates multimodal agents on open-ended computer tasks across real desktop environments. Prior top-performing models scored in the 40-50% range. GPT-5.4's reported performance approaches 75%.
That 25-percentage-point improvement matters for a specific attack class: goal hijacking via adversarial UI elements.
The technique is documented in MITRE ATLAS: an adversary crafts UI elements with misleading accessibility labels, invisible overlays positioned over legitimate controls, or tooltips carrying secondary instructions. When a vision-capable agent processes the UI as part of completing a task, it interprets these adversarial elements as legitimate application instructions and executes them alongside the original task. The attack has been demonstrated as a viable injection vector against computer use agents in controlled research environments.
The constraint, until now, was execution reliability. If a computer use agent only completes 40-50% of assigned tasks successfully anyway, the goal hijacking attack has a ceiling. The agent is as likely to fail on the legitimate task as on the hijacked one. The attack lands, but the outcome is unpredictable. Defenders get noise rather than a clean adversarial action.
At 75% task completion, that ceiling rises substantially. An adversarially crafted UI element that redirects the agent's goal now has a materially higher probability of producing a completed outcome. The attack vector itself is unchanged. What changed is that the agent is now reliable enough to finish what the malicious UI told it to do.
This is particularly relevant for organizations deploying agents with access to internal systems, file storage, or communication tools. The OWASP Agentic AI Top 10 identifies excessive agency and goal hijacking as top-tier risks for agentic deployments, and the OWASP Top 10 for Large Language Model Applications classifies prompt injection as the primary LLM risk; GPT-5.4's reliability improvements elevate the practical exploitability of both.
The stealth dimension: why reliability changes detection
A consistent attack is harder to iterate against than an inconsistent one.
When an injection attempt fires erratically, the anomalous behavior it produces is also erratic. A security team monitoring agent behavior for outliers might catch an unusual action on one run, fail to observe it on the next, and struggle to establish a reliable detection signature. The noise floor is high because the attacker's output is noisy.
When the same attack fires consistently, the attacker can iterate on their payload in a controlled way, observe reliable results, and optimize toward stealth before deploying at scale. A prompt injection that fires 80% of the time can be refined toward harder-to-detect variants. One that fires 20% of the time mostly generates inconclusive data for the attacker.
"Higher model reliability is a double-edged benchmark," notes the Repello AI Research Team. "Security teams should not read capability improvements as purely defensive progress. Every reliability gain that helps an agent complete legitimate tasks more consistently also helps an attacker predict and refine what a hijacked agent will do."
The practical consequence is that detection-based defenses calibrated to prior-generation attack inconsistency will underperform against GPT-5.4-class payloads. Behavioral baselines need to be re-established, and anomaly thresholds need to account for more consistent adversarial action patterns, not just the noisy, erratic signals earlier models produced.
What security teams running agentic AI need to account for now
The practical takeaway is not to avoid deploying capable models. It is to calibrate threat models against the actual capability of the models being deployed, rather than against prior-generation performance assumptions.
For long-context deployments, this means treating every source of retrieved or ingested content as a potential injection vector, not just user-submitted inputs. Document stores, email ingestion pipelines, and meeting transcript processors all feed content into the model's context. The reliable long-context handling that makes the agent useful also makes injections in that content more consistent.
For computer use agents, this means accepting that adversarial UI elements are now an operationally viable attack surface in a way they were not 18 months ago. Any agent with vision access to a desktop or web environment needs explicit controls on what actions it can take unilaterally versus what requires human verification. The NIST AI Risk Management Framework (AI 600-1) provides a governance baseline for categorizing and mitigating these agentic risks at the organizational level.
Repello's ARGUS runtime security layer addresses both vectors at the inference layer: monitoring agent inputs and outputs for behavioral anomalies, enforcing action constraints before execution, and flagging deviations from expected task patterns. For teams that want to assess their current exposure, ARTEMIS runs continuous adversarial testing across long-context injection and agentic attack chains, using the same techniques attackers are now able to execute more reliably.
Conclusion
GPT-5.4's reliability improvements are genuinely valuable for the enterprise use cases they enable. They are also, without modification, valuable for attackers targeting applications built on those models. Long-context injection and computer use goal hijacking are not new attack classes. What is new is that both now execute reliably rather than probabilistically, and that reliability enables the attacker iteration cycle that produces stealthier, harder-to-detect payloads.
Security teams should update their threat models before attacker tooling catches up to the capability curve. The window between a model capability announcement and that capability being integrated into offensive infrastructure has historically been short.
Ready to assess your agentic AI deployment against these threat vectors? Visit repello.ai/get-a-demo to talk to the team.
Frequently asked questions
What is the "lost in the middle" problem and why does it matter for AI security?
The "lost in the middle" problem refers to documented degradation in LLM performance on information located in the middle of long context windows, first described by Stanford researchers. For security teams, this degradation acted as an accidental constraint on injection attacks: instructions buried deep in large documents were less likely to be acted on reliably. Improved long-context handling in GPT-5.4-class models removes that constraint, making injections placed anywhere in a large context more consistent.
What is goal hijacking in computer use agents?
Goal hijacking is an attack where adversarially crafted UI elements redirect a computer use agent's task execution toward attacker-defined outcomes. The technique uses misleading accessibility labels, invisible overlays, or tooltips carrying secondary instructions. As computer use agents improve in task completion reliability, goal hijacking becomes more operationally viable because the hijacked goal now has a higher probability of completing rather than dying mid-execution.
Does deploying a more capable model automatically increase my security risk?
Not automatically, but it requires updating your threat model. The core issue is that security controls calibrated to prior-generation model behavior do not account for the higher execution consistency of attacks that depend on model reliability. The right response is to re-baseline your behavioral monitoring and red teaming regime against what the current model can actually do.
How does long-context injection differ from standard prompt injection?
Standard prompt injection involves adversarial instructions submitted directly through the user interface or embedded in immediately retrieved content. Long-context injection is a variant where instructions are placed deep within a large document or retrieval chunk, relying on the model's ability to attend to content across the full context window. Prior models had documented degradation in the middle of long contexts; GPT-5.4-class reliability removes that constraint.
What is OSWorld and why is the 75% benchmark relevant to security?
OSWorld is an open benchmark for evaluating multimodal agents on real computer tasks in desktop environments. Prior leading models scored in the 40-50% range on the benchmark. Scores approaching 75% represent a material improvement in task completion reliability, which directly raises the operational viability of attack techniques that depend on the agent reliably completing redirected or hijacked tasks.
How should security teams adjust runtime controls for GPT-5.4-class agents?
Two areas need immediate review. First, all pipelines that ingest external documents or retrieve from broad knowledge bases should be treated as injection surfaces, with content inspection at the retrieval layer rather than only at the user input layer. Second, computer use agents should have explicit action constraints requiring human verification for high-impact or irreversible operations, rather than allowing fully autonomous execution of all inferred tasks.
Share this blog
Subscribe to our newsletter











