Adversarial Testing vs Red Teaming vs Pentesting (AI)

Q: What is the difference between adversarial testing and red teaming for AI?

Adversarial testing is bounded: you feed a fixed battery of malicious or unexpected inputs to one model and measure where it misbehaves. Red teaming is goal-directed: a tester (or a system simulating one) chases a real attacker objective across the full deployment, which usually means the model plus its prompt orchestration, retrieval layer, tools, and downstream actions. The first answers can this model be tricked. The second answers can this deployment be compromised.

TL;DR: If you have heard your CISO use "adversarial testing", "red teaming", and "penetration testing" as if they were the same thing, you have heard them ask for three different activities. Adversarial testing is bounded probing of one model with a fixed input battery. Red teaming is goal-directed attack against a deployed system with tools, data, and consequences. Penetration testing is the contractual wrapper that usually contains both, plus traditional application security. Pick the term that matches what is actually bounded, goal-directed, or contractual, and stop trying to standardize a single canonical definition.

The one-sentence definitions#

Three terms, three activities. One sentence each for the definition. The rest of the post defends them.

AI adversarial testing is the practice of systematically feeding malicious, unexpected, or out-of-distribution inputs to a single model to find where its behavior breaks down. Scope is bounded: one model, one input distribution, one set of failure modes. Output is narrow and repeatable: for a given prompt, you record whether the model misclassified, leaked context, or accepted an instruction it should not have. This is the closest descendant of the original ML meaning of adversarial, where adversarial examples were gradient-based perturbations that flipped a classifier. The vocabulary expanded with LLMs to cover jailbreaks, prompt injection payloads, and disclosure probes, but the shape is the same: bounded inputs, bounded outputs.

AI red teaming is the simulation of a goal-directed adversary against the full deployment of an AI system. Scope is the model plus its prompt orchestration, its retrieval layer, its tools and connectors, its identity surface, and the downstream systems it touches. The output is not a list of prompts that broke the model. The output is a finding-narrative: how an attacker chained a malicious upload through retrieval, into the model, into a tool call, into an external email. The adversarial probe is one move in a longer game.

AI penetration testing is a structured engagement against a defined target with explicit rules of engagement, a scope boundary, a timeline, and a contractual deliverable, usually a report mapped to a framework. Most contractual of the three, most compliance-mappable, and often the wrapper procurement actually pays for. Adversarial testing and red teaming are activities inside it.

The three relate like nested boxes. Adversarial testing lives inside red teaming when the red team uses probe batteries as one of its tactics. Both live inside penetration testing when the pentest scope covers the deployed AI system rather than only the surrounding infrastructure. The relationship is real, but each box does something the other two cannot.

A visual taxonomy#

Bounded inside goal-directed inside contractual. Read the diagram from the middle out — the inner box is the activity, each ring beyond it is the wrapper that contains it.

When your CISO is asking for adversarial testing#

The CISO asks for adversarial testing when the question is about a model, not a system. Four scenarios show up reliably.

Pre-deployment evaluation of a new LLM-backed feature. A product team wants to ship a customer-facing chatbot, an internal assistant, or a new RAG flow. Before launch, someone needs to answer whether the model will say racist things, leak the system prompt, or accept obvious jailbreak payloads. The answer comes from a probe battery against the candidate configuration with per-category success rates.

Model selection. The team is choosing between two foundation models for the same task. The right question is not which one feels smarter in a demo, it is which one fails fewer attacks under the same battery. Adversarial testing produces a comparable number per model per attack category. Comparable is the point.

Regression after a model upgrade. GPT-5 ships, Claude updates, the in-house fine-tune retrains. The same battery runs against the new configuration. The diff between yesterday's pass rate and today's pass rate is the regression.

Continuous evaluation as a release gate. The probe battery becomes part of the deployment pipeline. A regression fails the build the same way a unit test failure fails the build. This is where automated adversarial testing produces the most value, because no human team scales to thousands of probes against every model update.

Adversarial test categories worth running on every deployment include:

Jailbreaks target the model's safety post-training. Personas like DAN, role-play frames, hypothetical scaffolds, and policy puppetry are attempts to push the model past the constraints its post-training tried to impose. OWASP catalogues these under LLM01 alongside prompt injection. Coverage means running known families, not one prompt per family.

Direct and indirect prompt injection target the boundary between operator instructions and untrusted input. Direct injection comes through the user channel. Indirect injection comes through a document, a web page, an email, a calendar invite, or any other content the model retrieves. Indirect is the one production teams underestimate, because the attacker never speaks to the user.

Sensitive information disclosure probes whether the model leaks its system prompt, its tool descriptions, its retrieved context, or memorized training data. Techniques include direct extraction prompts, completion attacks, and adversarial role-play. OWASP covers this under LLM02 and LLM07.

Bias and harmful content triggers are content-policy tests, not security tests in the traditional sense, but they share infrastructure with security adversarial testing and most regulators expect them in the same report.

Multilingual evasion probes whether attacks that fail in English succeed in lower-resource languages where safety post-training is thinner. One of the highest-yield categories against models that are otherwise well-tuned.

Tool-use abuse probes whether the model can be tricked into invoking tools it should not, with arguments it should not. Full tool-use abuse testing belongs to red teaming because the attack surface is the deployed system, but probing the model's tool-binding behavior in isolation is bounded enough to count as adversarial testing.

A good adversarial battery targets all of these categories, runs at the scale of thousands of probes per category, and produces a per-category success-rate number. That number is what you compare across models, releases, and configurations.

When your CISO is asking for red teaming#

The CISO asks for red teaming when the AI does something. Not when the AI says something, but when it acts.

The trigger condition is real consequences. A model that recommends a movie has a small blast radius. An agent that books travel, writes code into a repository, reads incoming emails, or calls APIs into production has a blast radius that includes everything those tools touch. The question shifts from whether the model can be tricked to whether the deployment can be compromised through the model.

Three scenarios push a CISO from adversarial testing into red teaming.

The first is path-based vulnerabilities. The exploit lives in a sequence, not a single prompt. A user uploads a PDF. Retrieval pulls a chunk into context. The model summarizes the chunk into a draft email. The agent sends the email to an external recipient. Each component looks safe in isolation. The chain leaks data. Adversarial testing of any one component misses this. Red teaming traces the path because the test is goal-directed.

The second is agent attack surfaces. Agentic AI multiplies the attack surface by the number of tools, connectors, memory layers, and trust boundaries. Workstation agents like Claude Code, Cursor, and Codex CLI are the canonical example: a single agent sits between the user, the file system, the shell, the editor, the browser, multiple MCP servers, and any skill bundle the user has loaded. Repello's research on the malicious skill supply chain and on trust-dialog bypass attacks both live in that surface. None of those vulnerabilities show up if you only probe the underlying model with prompts. They show up when a red team simulates the attacker targeting the deployed agent.

The third is stakeholder communication. An adversarial report says "the model accepted this jailbreak 27 percent of the time." A red team report says "a customer with no special access exfiltrated the support team's mailbox by sending one crafted email." The second story is what gets a non-technical executive to fund the fix.

Repello's ARTEMIS runs goal-directed red-team campaigns against deployed AI systems, chaining probes across the model, the connectors, and the tools the agent actually uses. Treat the audit-time engagement as the deep dive and ARTEMIS as the always-on regression layer between them.

Red teaming maps cleanly onto MITRE ATLAS, the adversarial knowledge base that catalogues techniques and tactics specific to ML systems. Where adversarial testing lives mostly in the ML Attack Staging tactic (crafting the probe), red teaming sequences ATLAS tactics from Reconnaissance through Impact, the same way ATT&CK sequences against traditional infrastructure. The ATLAS framework is the right reference for translating red-team findings into a vocabulary security leadership and SOC teams already understand.

When your CISO is asking for penetration testing#

Penetration testing is the contractual frame. Four scenarios drive a CISO to ask for one specifically.

Compliance audit. SOC 2 Type II, ISO 27001, ISO 42001, EU AI Act, sector regulations like HIPAA and FedRAMP, and customer-imposed SIG questionnaires all reference "penetration testing" by name. The check-box requires a deliverable that says penetration test at the top. The activity inside might be 70 percent adversarial probe runs and 30 percent path-tracing red team work, but the framework wants a pentest report.

Procurement. A customer's vendor security review asks for a recent AI pentest report. Internal red team exercises and CI-gated probe batteries do not satisfy that ask, because procurement wants third-party attestation with rules of engagement and scope.

Pre-launch sign-off. Before a high-risk feature ships (an agent with production write access, a customer-facing assistant with PII in context, a clinical tool with patient data) leadership wants a structured engagement with a clear scope, a clear deliverable, and a clear risk-acceptance moment. Pentest fits that ritual better than the more open-ended red team exercise.

Post-incident validation. Something broke. Leadership now wants an external, structured assessment that the fix holds. The deliverable matters as much as the testing.

The honest position on AI penetration testing: it is an umbrella term, not a distinct testing technique. A modern AI pentest contains an adversarial probe battery against the deployed model, a goal-directed red team exercise against the broader system, and traditional application security testing against the surrounding API, authentication, storage, and infrastructure layers. The thing that makes it a pentest is the rules of engagement, the framework mapping, and the report, not a unique methodology. Vendors selling AI pentesting that only deliver probe batteries are mislabeling adversarial testing. Vendors selling AI pentesting that skip the application security layer are mislabeling red teaming. Repello's deep guide to LLM pentesting walks the actual workflow end to end.

The collision#

Why did three terms collapse into each other. Four causes, in order of culpability.

Vendor marketing. Each term sells to a different buyer. Adversarial testing sells to ML engineering. Red teaming sells to security. Penetration testing sells to compliance and procurement. A vendor chasing all three buyers has every incentive to use the three words interchangeably, even when the underlying product only does one of them.

Framework inconsistency. OWASP LLM Top 10, NIST AI 600-1, MITRE ATLAS, and the EU AI Act all reference the activities but do not standardize the vocabulary. OWASP uses "adversarial testing" as the broad umbrella in some sections and "red teaming" in others. NIST AI 600-1 uses "red teaming" as a catch-all activity inside the Measure function. ATLAS uses "ML attack" without committing to either term. Industry chatter mirrors the inconsistency back, amplified.

Academic origin drift. Adversarial entered ML through adversarial examples, the Goodfellow et al. 2015 result on gradient-based perturbations that flipped image classifiers. That is the narrow meaning. The security industry adopted adversarial as a synonym for any attacker-driven probe, especially after the LLM transition where gradient-based attacks gave way to prompt-shaped attacks. Both meanings still circulate.

Buyer language. Customers ask for what their compliance team or their board told them to ask for. If the board said "we want AI red teaming", the buyer says red teaming. If the SOC 2 auditor said "we want a pentest", the buyer says pentest. The buyer is rarely the person who can disentangle the terms; that work falls on the vendor.

The fix is not to declare one canonical definition. Three definitions exist for a reason. The fix is to match the term to the activity by three properties: what is bounded (adversarial testing), what is goal-directed (red teaming), and what is contractual (pentesting). When the three properties align with the term, the vocabulary works. When they do not, ask what is being bought.

A practical decoder for buyers#

The matrix below is the artefact to bookmark. Scenario in the left column, term to use in the middle, what to actually ask for in the right column.

Scenario	Term to use	What to ask for
Pre-deployment evaluation of a new LLM feature	Adversarial testing	Probe battery across OWASP LLM Top 10 categories with per-category success rates
Comparing two candidate models	Adversarial testing	Same battery against both, side-by-side score sheet
CI / CD release gate	Adversarial testing	Automated probes, threshold rules, fail-build behavior on regression
Post-model-upgrade regression check	Adversarial testing	Same battery as last release, diff report
Agent has tools that touch production	Red teaming	Goal-directed scenarios, path-based findings, ATLAS-mapped narrative
Agentic workflow with external connectors (MCP, browser, mail)	Red teaming	Connector-aware scenarios, trust-boundary mapping, chained exploits
Demonstrating impact to executive leadership	Red teaming	Storyline-shaped report, blast-radius analysis, business-impact framing
SOC 2, ISO 42001, or EU AI Act audit	Pentesting	Framework-mapped engagement with formal scope, rules of engagement, deliverable
Customer procurement requesting attestation	Pentesting	Third-party report, executive summary, remediation tracking
Pre-launch sign-off for a high-risk feature	Pentesting + red teaming	Structured engagement scoping both probe coverage and goal-directed scenarios
Post-incident validation	Pentesting + targeted red teaming	Re-test of the exploited path plus broader coverage check
New feature in a low-risk tier (read-only chatbot, no tools)	Adversarial testing	Probe battery only; defer full red team until risk tier escalates

The primary dots stairstep down the diagonal — bounded questions in the adversarial column, goal-directed in the red-team column, contractual in the pentest column. The hollow dots are what each engagement contains.

The pattern is the same as the definitions. Bounded questions go to adversarial testing. Goal-directed questions go to red teaming. Contractual questions go to pentesting. When a scenario is bounded and contractual, the engagement is a pentest with an adversarial battery inside it. When a scenario is goal-directed and contractual, the engagement is a pentest with a red team exercise inside it. Most real scenarios are some mix of the three.

Where Repello sits#

Repello does all three, and the right way to read the product surface is to map each capability back to the matrix above.

Adversarial testing is ARTEMIS. Automated probe battery across OWASP LLM Top 10 categories, runs continuously, integrates into release pipelines, produces per-category success-rate numbers that compare cleanly across models and releases. The top four rows of the matrix expect this shape: bounded, repeatable, machine-scale.

Red teaming is ARTEMIS chained into deployment-level scenarios, plus human engagements. ARTEMIS handles the automated side, taking the same probe library used for adversarial testing and chaining it across the deployment surface (the model plus its connectors, tools, retrieval layer, and downstream actions) rather than treating the model in isolation. Scenario-driven human engagements handle the audit-time side, where a team chases specific goals across the deployment (an agent with mailbox access, an MCP-connected workstation agent, a RAG pipeline with a high-trust source). The middle three rows of the matrix expect this shape.

Penetration testing is human-led engagements with framework mapping. Scoped against the customer's deployment, framed against OWASP LLM Top 10, MITRE ATLAS, NIST AI 600-1, and any sector regulation the customer needs to satisfy. Includes adversarial probe coverage and goal-directed red team work inside the engagement, plus traditional application security against the surrounding stack. Produces the report the auditor wants. The bottom four rows of the matrix expect this shape.

The buying decision is not which vendor to choose for which term. The terms are activities, not products. The buying decision is which capability to start with, based on the scenario the CISO is currently facing. Pre-launch sign-off for a customer-facing assistant starts at ARTEMIS for the probe coverage and escalates into a scoped pentest engagement before the launch date. A workstation agent rollout starts at scenario-driven red teaming because the path-based exploits live in the connector graph. A SOC 2 audit starts with the pentest engagement because the deliverable is the contractual artefact.

If your CISO has asked for "AI adversarial testing", you can answer. If they asked for "AI red teaming", you can answer that too. If they asked for "AI penetration testing", same answer, different package. Tell us which of the three they asked for, and we will show how the same underlying capabilities answer all three.

Frequently asked questions#

What is the difference between adversarial testing and red teaming for AI?

Adversarial testing is bounded: a fixed battery of malicious or unexpected inputs against one model, measuring where it misbehaves. Red teaming is goal-directed: a tester or system simulating one chases a real attacker objective across the full deployment, which means the model plus its prompt orchestration, retrieval layer, tools, and downstream actions. Adversarial testing answers can this model be tricked. Red teaming answers can this deployment be compromised.

Is AI penetration testing the same as AI red teaming?

No. AI penetration testing is a structured assessment against a defined target with explicit rules of engagement, scope boundaries, and a contractual deliverable, usually a report mapped to a framework like OWASP LLM Top 10 or NIST AI 600-1. AI red teaming is the activity inside that assessment that chases a goal-directed adversary. In practice an AI pentest contains adversarial testing plus red teaming plus traditional application security work, which is the reason the three terms keep colliding.

When should a CISO ask for adversarial testing instead of red teaming?

Ask for adversarial testing when the question is about a model in isolation: pre-deployment evaluation of a new LLM feature, comparing two candidate models on a fixed attack battery, regression testing after a model upgrade, or running a CI gate that fails the build if attack success rates regress. Ask for red teaming when the question is about a deployed system with real consequences, especially anything agentic with tool access, RAG, or external connectors.

Does AI penetration testing replace adversarial testing and red teaming?

It wraps them. A serious AI pentest will include an adversarial probe battery, a goal-directed red team exercise, and traditional application security testing against the surrounding API, auth, and storage layers. The contractual frame is what makes it a pentest, not a different testing technique. Most vendors marketing AI pentests are selling some combination of the other two with a report attached.

Which framework should I map AI adversarial testing to?

Use OWASP LLM Top 10 as the coverage taxonomy for what to test, MITRE ATLAS as the technique knowledge base for how the attacks work, and NIST AI 600-1 as the governance layer for what to document. Adversarial testing aligns most cleanly with OWASP categories like LLM01 Prompt Injection, LLM02 Sensitive Information Disclosure, and LLM07 System Prompt Leakage. Red teaming maps better to ATLAS tactic sequences like ML Attack Staging plus Exfiltration.

Why did adversarial testing and red teaming get used interchangeably?

The word adversarial originally meant adversarial examples in machine learning research: small gradient-based perturbations that flipped a classifier. As LLMs took over, security teams started calling any goal-directed attack adversarial, and red teaming vendors marketed every probe battery as red teaming. The result is three vocabularies stacked on top of each other. The honest fix is not to standardize one definition, it is to match the term to the activity by what is bounded, what is goal-directed, and what is contractual.