Back to all blogs

|
|
10 min read


TL;DR: AI red teaming is systematic adversarial testing of LLMs and AI agents to find exploitable vulnerabilities before attackers do. It differs fundamentally from traditional penetration testing: the attack surface is probabilistic, vulnerabilities are model-behavior-based, and patches are not discrete code fixes. The OWASP LLM Top 10 provides the coverage framework. The five-step methodology in this guide gives you the operational workflow. For continuous coverage at scale, automated red teaming platforms like ARTEMIS are increasingly a security team requirement, not a nice-to-have.
What is AI red teaming?
AI red teaming is the practice of probing AI systems, including large language models, multimodal models, and AI agents, using adversarial techniques to surface security vulnerabilities before attackers can exploit them. The objective is identical to any red team exercise: produce empirical findings about what an attacker can do, based on actual attack attempts against the live system.
The term comes from military and corporate security tradition, but AI red teaming targets a fundamentally different attack surface. You are not exploiting misconfigured network services or unpatched CVEs. You are testing how a model responds to adversarial inputs, whether it can be induced to violate its own operating constraints, what information it leaks from its context window or training data, and how it behaves when embedded in agentic workflows with access to real tools and systems.
Microsoft's AI Red Team distinguishes two overlapping objectives: safety red teaming (testing for harmful content generation and policy violations) and security red teaming (testing for data exfiltration, system compromise, and unauthorized tool use). Enterprise security programs need both, but the security dimension is consistently underinvested because it does not map cleanly to existing vulnerability management workflows.
The NIST AI Risk Management Framework explicitly includes adversarial testing in its Manage function guidance on AI system monitoring and response. Building AI red teaming into your NIST AI RMF implementation satisfies both the security objective and the governance documentation requirement.
How AI red teaming differs from traditional penetration testing
Traditional penetration testing targets a deterministic attack surface: network services, application logic, authentication systems, access controls. Vulnerabilities either exist or they do not. Findings map to CVEs. Patches are discrete changes to code or configuration.
AI systems work differently. A large language model is a probabilistic function. The same input can produce different outputs across runs. Vulnerabilities are often emergent: they arise not from a single misconfigured component but from how the model interprets context, instruction, and intent under adversarial conditions.
Four specific challenges separate AI red teaming from traditional practice:
Coverage is statistical, not binary. You cannot enumerate all possible inputs to an LLM. Effective testing requires sampling across attack categories and measuring success rates. A prompt injection technique that succeeds 10% of the time is a real, exploitable vulnerability that a determined attacker will exercise at scale.
Vulnerabilities do not patch cleanly. Remediating a prompt injection bypass may require modifying the system prompt, adding output filtering, deploying a guard model, or retraining. Each intervention can introduce new failure modes. Testing the fix is as important as finding the original vulnerability.
The threat model includes model behavior, not just external adversaries. In traditional security, the application is trusted and attackers are external. In AI security, the model itself can become a vector: an attacker who manipulates the model into acting against its operator's interests has achieved a meaningful compromise without touching any underlying infrastructure.
Deployment context defines the attack surface. A base model that behaves safely as a standalone chat interface can be directly exploitable when connected to a file system, email client, code interpreter, or customer database. Red teaming must target the deployed configuration as a whole, not the model in isolation. Repello's analysis of security threats in agentic AI browsers documents how this gap plays out in real deployments.
Dimension | Traditional penetration testing | AI red teaming |
|---|---|---|
Attack surface | Network services, application code, authentication, access controls | Model behavior, context window, training data, agentic tool layer, RAG pipeline |
Failure mode | Deterministic: vulnerability either exists or is patched | Probabilistic: attack may succeed 8% or 80% of attempts; same input can produce different outputs |
Coverage method | CVE-mapped exploit library, scanner output, manual exploitation | OWASP LLM Top 10 coverage, adversarial input sampling across attack categories, behavioral measurement |
Remediation approach | Discrete code or configuration fix; patch and retest | System prompt changes, output filtering, guardrail deployment, fine-tuning; each fix can introduce new failure modes |
Frequency | Point-in-time engagement, typically quarterly or annually | Continuous; every model update, prompt change, or new data source connection resets the attack surface |
Tooling | Nmap, Burp Suite, Metasploit, static analysers | Garak, PyRIT, MITRE ATLAS, automated red teaming platforms, manual adversarial testing |
The OWASP LLM Top 10 as a red team framework
The OWASP LLM Top 10 (2025 edition) is the standard classification of security risks in deployed LLM applications. For red teams, it functions as a structured test plan: ten attack categories, each mapping to specific adversarial techniques, that any production LLM deployment should be assessed against. Coverage across all ten categories provides a defensible baseline that holds up to auditor and leadership scrutiny.
LLM01: Prompt Injection. Test whether adversarial inputs can override system prompt instructions. Includes direct injection (the user manipulates model behavior through their input) and indirect injection (malicious instructions embedded in documents, web pages, or API responses the model retrieves). This is the highest-risk and most actively exploited category in deployed systems.
LLM02: Sensitive Information Disclosure. Test whether the model leaks training data, system prompt contents, or data from other users' sessions. Techniques include completion attacks, role-play prompts, and context manipulation designed to push the model past its disclosure boundaries.
LLM03: Supply Chain Vulnerabilities. Assess the security of fine-tuning datasets, third-party base models, and plugins connected to the LLM. This maps traditional supply chain security analysis onto model artifacts and training pipelines.
LLM04: Data and Model Poisoning. If your deployment includes fine-tuning or retrieval-augmented generation, test whether an attacker could introduce malicious content into the training data or knowledge base to manipulate downstream model behavior. Repello's research on RAG poisoning attacks against production LLM systems demonstrates how this executes against live RAG pipelines with measurable behavioral impact.
LLM05: Improper Output Handling. Test what happens when model output is passed directly to downstream systems: SQL queries, shell commands, HTML renderers, API calls. If output is not sanitized before execution, the model becomes an injection vector into those systems regardless of how well the model itself is protected.
LLM06: Excessive Agency. For agentic systems, test whether the model can be induced to take actions beyond its intended scope: sending emails, modifying files, escalating API permissions, executing code. A 2024 study by UIUC researchers found that LLM-based agents successfully exploited real one-day vulnerabilities in 87% of tested cases when given tool access, demonstrating that agency without strict scope control is a critical security failure.
LLM07: System Prompt Leakage. Attempt to extract the system prompt through direct requests, inference, or jailbreaking. System prompts frequently contain sensitive business logic, internal tool descriptions, and occasionally API credentials that operators did not intend to expose.
LLM08: Vector and Embedding Weaknesses. If your system uses semantic search over a vector database, test whether adversarial queries can manipulate retrieval results or whether sensitive documents can be extracted through embedding-based attacks.
LLM09: Misinformation. Test whether the model can be induced to generate confidently-stated false information in contexts where accuracy is critical: legal documents, security advisories, financial analysis, or medical information.
LLM10: Unbounded Consumption. Test for denial-of-service conditions: inputs that cause excessive token generation, runaway agentic loops, or compute-intensive processing that degrades service availability or creates unexpected infrastructure cost.
Depth of testing should reflect your deployment's risk profile. A customer-facing chatbot with read-only RAG access has a very different risk distribution than an agentic workflow with write access to production systems. Prioritize accordingly.
A 5-step AI red team methodology
The following methodology is designed for enterprise teams and built to be repeatable rather than a one-time assessment.
Step 1: Threat modeling and scope definition
Before writing a single test case, define what you are testing, who the adversaries are, and what a successful attack looks like. Identify the system's key assets: customer data in context, system credentials in the prompt, connected APIs with privileged access. Map potential attacker profiles: external users, internal users with elevated permissions, third-party integrations with access to the model's context.
Document the full deployment architecture: base model version, system prompt, fine-tuning configuration, RAG data sources, connected tools and their permission levels, output handling pipeline, and user access tiers. Without this map, testing will systematically miss attack vectors. A red team engagement is only as complete as its scope definition.
Step 2: Attack surface enumeration
Map every input channel and every downstream system the model can affect. For a typical enterprise LLM deployment this includes: the user input channel, documents and data the model retrieves from, external APIs it can call, dynamic context injected into the prompt, the output rendering layer (is output executed anywhere downstream?), and logging systems that process model output.
For agentic systems, document the permission set precisely. What files can the agent read and write? What network calls can it make? What actions are irreversible: sent emails, deleted records, published content, executed code? The scope of potential downstream impact determines severity weighting for every finding.
Step 3: Adversarial testing
Execute test cases across the OWASP LLM Top 10 categories relevant to your deployment. Run both automated testing using the tools covered in the next section and manual testing, which remains essential for novel attack chains that automated tools miss. Automated tools provide coverage breadth; manual testing provides depth and creative adversarial judgment.
For each category, test multiple variations. Prompt injection has hundreds of documented bypass techniques; testing a single variant is not adequate coverage. Record exact inputs, model responses, attack success or failure, and what information or access a successful attack provides. The LLM pentesting checklist provides a structured breakdown of test cases by attack category.
Step 4: Findings analysis and risk scoring
Score each finding using a consistent rubric. CVSS is not well-adapted to LLM vulnerabilities. A practical adapted rubric scores findings across: exploitability (how reliably can the attack be reproduced?), impact scope (does it affect only the model, or extend to connected systems?), data sensitivity (what is accessible through a successful exploit?), and effort required (single-prompt or complex multi-turn manipulation?).
Produce a findings report that maps each vulnerability to its OWASP LLM Top 10 category, the specific deployment component affected, a reproduction case, and a recommended remediation path. Vague findings like "the model can be jailbroken" are not actionable. Specific findings with reproduction steps and scoped impact are.
Step 5: Remediation validation and continuous retesting
Test remediations before closing findings. An output filter that blocks the tested payload may not block semantically equivalent variants. A system prompt patch that fixes one attack vector may introduce unintended behavioral changes in other areas. Regression testing is not optional in AI security.
Critically, AI red teaming must be continuous rather than periodic. Repello's research on the zero-day collapse in AI security documents how mean time to exploit for known AI vulnerabilities has collapsed from hundreds of days to hours as the attacker community industrializes techniques. A point-in-time red team report has a short validity window; the model, its connected data sources, and the threat landscape all change continuously.
Tools for AI red teaming
The tooling ecosystem is maturing, but has not yet reached parity with traditional security tooling. The following tools form the practical baseline for an enterprise AI red team program.
Garak (open source) is an LLM vulnerability scanner built specifically for adversarial probing of language models. It runs automated probe sequences across multiple attack categories and generates structured findings reports. Garak is most effective for rapid baseline coverage when onboarding a new model or deployment configuration.
PyRIT (Microsoft, open source) is the Python Risk Identification Toolkit for AI, released by Microsoft's AI Red Team. PyRIT provides components for building custom adversarial testing workflows: multi-turn adversarial conversation loops, jailbreak libraries, and automated output scoring. It is designed for security teams building bespoke test suites against specific deployment contexts rather than running off-the-shelf probes.
MITRE ATLAS is the adversarial ML threat knowledge base, functionally equivalent to ATT&CK for AI systems. ATLAS catalogs documented attack techniques against AI and ML systems with case study evidence from real-world incidents. Use it to map test coverage to confirmed real-world techniques, which strengthens findings reports and prioritization decisions.
Manual testing remains the highest-signal approach for complex deployments. Skilled adversarial testers find attack chains that automated tools miss: multi-turn manipulations, encoding tricks, context-specific instructions that exploit application-specific model behavior. Manual capacity is not a substitute for automated tooling, but neither is automated tooling a substitute for manual judgment.
For teams looking to automate continuous coverage between manual engagements, breach and attack simulation for AI provides the systematic breadth layer: automated scenarios running after every model update or configuration change, producing current-state control coverage without the scheduling constraints of a full red team exercise.
Automated AI red teaming at scale
Enterprise deployments face a scaling constraint that manual testing cannot solve. A production LLM handles high input volumes; its behavior can change with every model update, fine-tuning run, system prompt modification, or new data source connection. Security teams need continuous adversarial coverage, not a quarterly assessment.
ARTEMIS, Repello's automated red teaming engine, is built for this operational requirement. It runs continuous adversarial probing across OWASP LLM Top 10 categories, surfaces new vulnerabilities as the deployment evolves, and integrates findings into existing security workflows.
Repello's benchmark testing across model families and deployment configurations shows substantial variance driven by configuration rather than base model selection. "Identical base models can show a 4.8% breach rate under hardened configuration and a 28.6% breach rate under permissive default settings," according to Repello AI Research Team data. This variance is only discoverable through systematic adversarial testing at scale. Manual exercises cannot produce the statistical sample needed to detect configuration-level risk differences of this magnitude.
For teams building out their AI security program from scratch, the LLM pentesting guide provides a detailed starting methodology that complements automated tooling with structured manual test coverage.
Building AI red teaming into your security program
Running a red team engagement against a new AI deployment for the first time is tractable: use the methodology above, document findings, implement remediations. The harder organizational challenge is making AI red teaming continuous and integrated rather than a standalone event.
Three operational practices make this sustainable. First, tie red team testing to the model release cycle: any update to the base model, fine-tuning dataset, system prompt, or connected tools triggers a targeted re-test of the affected attack categories. Second, track breach rates over time as a security metric rather than treating each engagement as independent: regression in attack resistance is an early signal that something in the deployment changed. Third, route findings directly to both security and ML engineering: the teams responsible for model behavior need specific, reproducible findings with severity context, not risk summaries.
The security teams that are ahead on AI red teaming have treated it the same way they treat vulnerability management: automated continuous scanning, structured findings pipeline, SLA-backed remediation, and regression testing as standard practice. The organizations still running annual assessments against AI systems that change weekly are operating with a structural visibility gap.
AI red teaming is not a checkbox exercise. Every model update, every new RAG data source, every agentic tool added to a deployment is a potential new attack surface. The question for security teams is not whether to red team AI systems; it is whether to do it systematically before incidents or reactively after them.
Request an ARTEMIS demo to see continuous automated AI red teaming operating in a production environment.
Frequently asked questions
What is AI red teaming?
AI red teaming is the systematic adversarial testing of AI systems, including large language models and AI agents, to identify security vulnerabilities before attackers can exploit them. It covers attack categories including prompt injection, sensitive data extraction, jailbreaking, agentic tool abuse, and RAG poisoning. Unlike vendor security assessments or compliance audits, red teaming produces empirical findings based on actual attack attempts against the deployed system in its live configuration.
What is the difference between AI red teaming and traditional red teaming?
AI red teaming and traditional red teaming share the same objective: find what an attacker can do before the attacker finds it first. The difference is the attack surface. Traditional red teaming probes networks, endpoints, and application logic for deterministic vulnerabilities that map to CVEs and patch with code changes. AI red teaming probes model behavior, training pipelines, and agentic tool layers for probabilistic failure modes that patch with system prompt changes, guardrail deployment, or fine-tuning. An enterprise running traditional red team exercises but not AI red teaming has blind spots across every AI system it has deployed.
How is AI red teaming different from traditional penetration testing?
Traditional penetration testing targets infrastructure: network services, application logic, and access controls with deterministic, patchable vulnerability signatures. AI red teaming targets model behavior: how the system responds to adversarial inputs, whether it can be manipulated into violating its operating constraints, and whether it leaks information from its training data or context window. The attack surface is probabilistic, which means testing must cover distributions of inputs and measure success rates. Remediation also differs: fixing a model behavior requires system prompt changes, output filtering, fine-tuning, or architectural changes rather than a discrete code patch.
What does an AI red team actually test?
An AI red team tests the full attack surface of a deployed AI system: direct and indirect prompt injection, system prompt extraction, jailbreaking and safety bypass, sensitive data extraction from context or training data, agentic tool abuse and excessive agency, RAG poisoning and vector embedding manipulation, improper output handling that creates injection vectors into downstream systems, and denial-of-service conditions from unbounded resource consumption. Coverage should map to the OWASP LLM Top 10, prioritized by the specific deployment's risk profile.
What is the OWASP LLM Top 10 and how does it apply to red teaming?
The OWASP LLM Top 10 is the authoritative classification of the ten most critical security risks in deployed LLM applications, maintained by the Open Worldwide Application Security Project. Each category, from prompt injection (LLM01) to unbounded consumption (LLM10), maps to specific adversarial techniques that security teams can translate into concrete test cases. Using it as a red team coverage framework ensures that engagements address the categories with the most documented real-world exploitation history, rather than relying on ad hoc test design.
How often should organizations conduct AI red teaming?
AI red teaming should be continuous rather than periodic. Model behavior can change with model updates, fine-tuning runs, system prompt modifications, or changes to connected data sources, any of which can introduce new vulnerabilities or regress existing remediations. A one-time assessment produces a point-in-time snapshot that may be outdated within weeks. Security teams should run automated adversarial testing on every deployment change and conduct manual exercises when the deployment architecture changes significantly.
What tools are used for AI red teaming?
The primary open-source tools are Garak (automated LLM vulnerability scanning), PyRIT from Microsoft (a framework for building custom adversarial testing workflows), and MITRE ATLAS (the adversarial ML threat knowledge base for technique-to-coverage mapping). Enterprise teams typically layer these with a dedicated platform for continuous automated coverage and structured findings management. Repello's ARTEMIS provides automated probing across the OWASP LLM Top 10 with integration into existing security workflows and continuous monitoring across model updates.
Share this blog
Subscribe to our newsletter











