Claude Code Red Teaming: What a Coding Agent Can't Do

TL;DR: Pointing Claude Code, Cursor, or Codex CLI at your AI application is not AI red teaming. It is an orchestrator running without a harness. Four things are missing: the refusal-free authorized context a working payload needs (the model is RLHF-trained to sandbag dual-use requests), a persistent library of what worked against similar targets, a separate verifier that confirms an attack actually succeeded, and a coverage engine that chains across turns, surfaces, and modalities. Frontier models are excellent orchestrators, and a purpose-built harness may even orchestrate them under the hood. The harness is what turns the intern you supervise into the employee who does the job.

The setup that feels like red teaming#

A security engineer opens Claude Code, running on the latest Opus 4.8, points it at the staging URL of the company's new support agent, and types: "Red team this. Find prompt injections, jailbreaks, data leaks, anything." The agent spins up, reads the page, fires off some probes, writes a tidy markdown report with a few findings, and declares the surface "mostly hardened." It feels like a red team. It produced an artifact that looks like a red-team report.

It is not a red team. It is a single capable model running one pass with no scaffolding around it, and the gap between that and an autonomous red teamer is the difference between an intern's afternoon and a function that delivers assurance. The model is doing the thing red teamers do at the keyboard (read, hypothesize, probe) while missing everything that makes the activity rigorous: the freedom to produce working attacks, a memory of what has worked before, a way to confirm that an attack landed, and the coverage to test more than the one surface you pointed it at.

The taxonomy matters here, because the words get used loosely. We have written separately on the difference between adversarial testing, red teaming, and penetration testing: bounded probing of one model, goal-directed attack against a deployed system, and the contractual wrapper around both. A coding agent pointed at an app does a thin slice of the first. Calling it the second is a category error.

The orchestrator-without-a-harness problem#

A frontier model is the orchestrator. It plans, it reasons across steps, it writes code, it calls tools, it decides what to try next. That capability is real and it is the engine of every modern autonomous system. But an engine is not a vehicle. A red-team harness is the chassis, the transmission, and the instrumentation that turns the engine into something that gets you to a destination repeatably.

The harness is four things the bare model does not have: an authorized context where it can generate working exploits without refusing, a persistent attack memory that accumulates across targets, a verifier loop that confirms success, and a coverage engine that chains attacks across surfaces and modalities. The rest of this post takes them one at a time. Each one, on its own, is the difference between "looks like a finding" and "is a proven exploit."

Capability matrix comparing a frontier model pointed at your app against a purpose-built harness (ARTEMIS) across the five things red teaming requires: refusal-free working payloads in an authorized context, persistent cross-target attack memory, a separate verifier loop, multi-stage chaining plus multimodal and agentic coverage, and autonomy. The frontier-model column is empty on all five; the harness column has all five, the only filled cluster. — Zero of five versus five of five: the frontier model is the orchestrator, and the harness is everything the bare model lacks.

The safety-classifier ceiling#

The first and least obvious limit is that a frontier model is trained, correctly, not to be a good attacker. Models like Claude are aligned through RLHF and Constitutional AI, and wrapped in layered safety classifiers that inspect the prompt, the completion, and the full conversation. When you ask a general assistant to produce a working jailbreak chain, a live prompt-injection exploit, or a data-exfiltration sequence, those systems do their job: the model sandbags, softens, or refuses.

This is the correct behavior for a general assistant. The same capability that red teams an authorized target also harms an unauthorized one, and the model has no reliable way to tell your staging environment apart from someone else's production. Anthropic's own reflections on its Responsible Scaling Policy describe a defense-in-depth posture built precisely to refuse dual-use requests at multiple stages. From the model's seat, "generate a working exfiltration payload for this agent" is indistinguishable from misuse.

For a red teamer, that instinct is exactly backwards. A red teamer needs the working payload, the one that actually fires, not the sanitized illustration of one. A purpose-built harness operates inside a controlled, authorized, scoped red-team context where producing the real attack is the point, governed by rules of engagement rather than by a general-purpose refusal policy. The bare coding agent gives you the version of the attack that is safe to hand a stranger. That is the wrong product.

You can feel this ceiling the moment you push past reconnaissance. Ask a coding agent to describe prompt injection and it is fluent. Ask it to hand you the live multi-step payload that defeats the specific guardrail in front of it, instrumented to confirm the leak, and it gets cautious, partial, or apologetic. The fluency was never the bottleneck. The willingness to produce the working thing was.

No persistent attack memory#

A coding agent starts every session from zero. Whatever it learned breaking your agent on Tuesday is gone on Wednesday, beyond whatever you manually paste back into context. There is no accumulating asset.

A real red-team harness keeps an evolving payload library. When an attack works against one customer-support agent, the harness records the chain, the conditions under which it fired, and the guardrail it defeated, then reuses and mutates it against the next target with a similar shape. Attacks that worked against one RAG pipeline become the starting hypotheses against the next one. The library compounds. This is the same property that makes human red teams valuable over time: institutional memory of what breaks.

The compounding matters because the attack surface is not static. The systematic analysis in arXiv:2601.17548, which synthesized 78 studies and cataloged 42 distinct attack techniques across skills, tools, and protocol ecosystems, found that attack success rates against state-of-the-art defenses exceed 85 percent once adaptive strategies are used. Adaptive is the operative word. The attacks that win are the ones that evolve against the defense, and evolution needs memory of the prior generation. A stateless agent cannot run an adaptive campaign because it has nothing to adapt from.

No verifier loop#

This is the quiet failure that produces false confidence. A coding agent pointed at your app generates an attack, observes a response, makes a judgment call about whether it worked, and moves on. The judgment call is the same model that generated the attack grading its own homework, in the same context, with no independent confirmation.

A harness separates the attacker from the verifier. The verifier is a distinct component whose only job is to answer a binary question: did this attack achieve its objective. Did the payload actually exfiltrate the data, or did the model merely produce text that looks like a leak. Did the jailbreak hold across the turn, or did a downstream filter catch it. Did the tool call execute, or was it described and then blocked. That signal feeds back into the campaign: confirmed successes get promoted into the payload library, near-misses get mutated, and false positives get discarded instead of shipped in a report.

Without the verifier, you get two failure modes, and both are expensive. The agent reports an exploit that did not actually land, and an engineer burns a day chasing a phantom. Or the agent declares a surface clean because it could not confirm a success it never properly tested, and a real exploit ships to production. The same analysis that found 85-percent adaptive success rates also found that, of 18 defense mechanisms reviewed, most achieved under 50 percent mitigation against adaptive attacks. A red-team tool that cannot verify its own results cannot tell you which side of that line your application is on.

No multi-stage chaining or coverage#

Real attacks are not single prompts. They chain. A document with hidden instructions gets retrieved into context, the model summarizes it into a draft action, a tool call fires on the poisoned summary, and the result lands in an external system. Each step looks benign in isolation. The exploit lives in the sequence. A coding agent firing the prompts you handed it, against the one surface you pointed it at, never traces that path because the path is goal-directed and multi-stage by nature.

Coverage is the other half. The OWASP LLM Top 10 and MITRE ATLAS exist precisely because the attack surface of a modern AI application is wide and structured: direct and indirect prompt injection, sensitive-information disclosure, tool-call and MCP boundaries, RAG-pipeline poisoning, agentic action surfaces, and multimodal vectors across text, image, document, and voice. The same coding-agent runtime is itself one of those surfaces. We have documented how the permission and tool-execution layer of agents like Claude Code becomes an attack surface in its own right, and how indirect prompt injection turns retrieved content into instructions. A harness enumerates these surfaces and runs the full battery; a one-shot agent tests the slice in front of it.

This is where the chaining and the memory and the verifier compound into something a single pass cannot reach. A coverage engine knows the surface map, the payload library supplies candidate attacks per surface, the orchestrator chains them across turns, and the verifier confirms which chains held. Remove any one of those and you are back to an intern with a good prompt.

The honest concession: models are excellent orchestrators#

None of this means frontier models are bad at the job. They are the best orchestrators we have ever had, and that is exactly why this argument is not "models are weak." A purpose-built red-team harness like ARTEMIS may orchestrate frontier models under the hood, using their planning and reasoning as the engine while supplying everything the bare model lacks.

The distinction is component versus system. The model is one component: the reasoning core that decides what to try next. The harness is the system around it: the authorized refusal-free context that lets it generate working payloads, the persistent memory that accumulates an evolving attack library across targets, the independent verifier that confirms which attacks actually succeeded, and the coverage engine that chains across every surface and modality the application exposes. Put those together and the component becomes an autonomous red teamer. Leave them out and the component is a very capable intern you have to supervise on every move.

That is the whole argument in one line: you can hire the intern, or you can hire the employee. A coding agent pointed at your app is the intern, and a good one. It will surface ideas, reproduce a known issue, and sketch an attack. But it will refuse to weaponize, it will forget what it learned, it will grade its own work, and it will test only what you remembered to point it at. For reconnaissance and triage, that is genuinely useful. For assurance, you need the harness.

The buyer's question, then, is not "is this model good." It obviously is. The question is whether the thing you are evaluating is a model or a red teamer. We lay out how to tell them apart, and what a real autonomous red teamer has to prove, in our companion piece on how to evaluate a real autonomous red teamer.

Where Repello fits#

Repello builds ARTEMIS, the purpose-built autonomous red teamer that supplies the harness around the orchestrator. It operates in an authorized, scoped red-team context so it generates working attacks rather than sanitized illustrations of them. It keeps an evolving payload library that compounds across engagements. It runs a separate verifier that confirms which attacks actually succeeded before anything reaches a report. And it covers the full surface map (prompt injection, tool and MCP boundaries, RAG poisoning, agentic actions, and multimodal vectors) the way OWASP and ATLAS lay it out, then re-runs the library every time the application changes so coverage is continuous rather than a one-time snapshot.

If you have been pointing a coding agent at your AI application and calling it red teaming, the fastest way to see the gap is to run both against the same target and compare what each one can actually prove. Book a demo and we will show you the difference between an orchestrator and a harness on your own surface.

FAQ#

Can I use Claude Code to red team my own AI application?#

You can use it to explore, and it is genuinely good at that. What you cannot do is treat it as an autonomous red teamer. A frontier coding agent is an orchestrator with no harness around it: it has no refusal-free authorized context for generating working exploits, no persistent library of attacks that worked against similar targets, no separate verifier that confirms whether an attack actually succeeded, and no coverage engine that chains across turns, surfaces, and modalities. It tests what you prompt it to test, once, and then forgets. That is a useful intern, not a red team.

Why does a frontier model refuse to generate a working jailbreak or exploit?#

Because the same capability is dual-use. Models like Claude are trained with RLHF, Constitutional AI, and layered safety classifiers that inspect the prompt, the completion, and the whole conversation. When you ask a general assistant to produce a working prompt-injection chain or a data-exfiltration sequence, it sandbags, softens, or refuses, because it cannot tell your authorized red-team context apart from an attacker's. That is correct behavior for a general assistant and exactly the wrong instinct for a red teamer, which needs to operate inside a controlled, authorized context without those refusals.

Isn't a frontier model good enough if I just give it a strong red-team prompt?#

A strong prompt improves a single run. It does not add the four things a harness provides: an authorized refusal-free context for working payloads, persistent cross-target attack memory, a verifier that closes the loop on success, and a coverage engine across surfaces and modalities. Prompt engineering operates inside one session against one surface. A harness operates across sessions, targets, and surfaces, and gets better as it accumulates evidence.

Are frontier models bad at red teaming, then?#

No. Frontier models are excellent orchestrators, which is why a purpose-built harness like ARTEMIS may orchestrate them under the hood. The model is one component. The harness is the memory, the verifier, the evolving payload library, the coverage engine, and the freedom-from-refusal of an authorized red-team context that turns that component into an autonomous red teamer. The point is not that models are weak. The point is that a model alone is the intern you supervise, not the employee who does the job.

What attack surfaces does a coding-agent-pointed-at-your-app miss?#

The ones that only appear when an attack chains. Multi-turn jailbreaks that build state across a conversation, indirect prompt injection through documents and retrieved content, tool-call and MCP boundaries where the agent acts on untrusted data, RAG-pipeline poisoning, and multimodal vectors in images, documents, and voice. A coding assistant fires the prompts you hand it against the surface you point it at. It does not enumerate the agentic and retrieval surfaces the way MITRE ATLAS and the OWASP LLM Top 10 lay them out.

When is it fine to point Claude Code at an app, and when do I need a real harness?#

Pointing a coding agent at an app is fine for reconnaissance, for reproducing a known issue, for sketching an attack idea, and for triaging a single finding. You need a purpose-built harness the moment the question is assurance: full coverage across surfaces and modalities, working payloads under an authorized context, a verifier that proves which attacks held, and a regression layer that re-runs the evolving payload library every time the application changes.