AI Red Teaming Metrics: How to Know If Your Red Team Is Actually Working

TL;DR: Most AI red teaming programs produce reports, not measurements. Without a defined metrics framework, there is no way to know whether your red team is improving your security posture, catching regressions, or simply generating findings that get filed and forgotten. The six metrics that matter are: attack success rate (ASR), guardrail bypass rate, mean time to detection (MTTD), coverage completeness, false positive rate, and cross-version regression delta. Tracked over time, these turn red teaming from a one-time audit into a continuous improvement loop.

The measurement gap in AI security programs#

Traditional application security has decades of metrics infrastructure: CVE severity scores, mean time to remediate (MTTR), patch coverage rates, vulnerability age distributions. Security leaders can walk into a board meeting with a dashboard that answers "are we getting better or worse?"

AI red teaming programs, even well-resourced ones, frequently lack any equivalent. A team runs a red team exercise, produces a findings report, developers remediate some issues, and the cycle ends. Six months later another exercise runs. Whether the second exercise reflects a more secure system than the first is often a matter of qualitative judgment rather than measurement.

This is not a minor gap. Without metrics, red teaming cannot drive continuous improvement. It cannot detect regressions when a model is updated. It cannot justify budget to security leadership. And it cannot answer the question that matters most: is our AI system more or less exploitable than it was last quarter?

"A red team without measurement is a cost center," says the Repello AI Research Team. "A red team with measurement is a security capability. The difference is whether you can prove your program is working."

The six metrics that matter#

1. Attack success rate (ASR)#

Attack success rate is the percentage of adversarial probes that elicit a policy-violating or attacker-intended response from the target model. It is the foundational metric: the closest AI security equivalent to "how many of our vulnerabilities are exploitable?"

ASR should be measured per attack category (prompt injection, jailbreak, data extraction, indirect injection) rather than as a single aggregate. An ASR of 12% means very little in isolation; an ASR of 12% on direct jailbreaks and 34% on indirect prompt injection through retrieved documents tells you exactly where your exposure is concentrated.

Repello's red team data from comparative assessments shows ASR varying dramatically across model versions and deployment configurations. In testing documented in the Claude jailbreaking authority post, breach rates differed by more than 20 percentage points between model families under identical attack conditions. That kind of variance is only visible if you are measuring ASR consistently.

2. Guardrail bypass rate#

Guardrail bypass rate measures how often adversarial inputs successfully evade a deployed safety layer to reach the underlying model. It is distinct from ASR: a probe can fail to bypass a guardrail (guardrail fires correctly) but still succeed in eliciting a harmful response if the guardrail has high false negative rates on certain attack classes.

Track guardrail bypass rate separately from ASR to isolate whether your detection layer or your model's underlying behavior is the source of exposure. If bypass rate is low but ASR on bypassed inputs is high, your guardrail architecture is your primary control but the model itself is highly exploitable when that control fails. If bypass rate is high, your guardrail has coverage gaps that need addressing before the model's behavior matters.

3. Mean time to detection (MTTD) for novel attacks#

In traditional security, MTTD measures how long an active intrusion goes undetected. In AI security, the equivalent question is: how long does it take for a new attack technique to be tested against your system after it becomes known in the research community?

Repello's zero-day collapse analysis documented that the time between public disclosure of a new LLM attack technique and adversarial exploitation in the wild has collapsed from months to days. An AI security program that red teams quarterly is operating with a de facto MTTD measured in weeks. Organizations that accept this posture should document it as a known risk, not treat it as the default.

MTTD in AI security is a function of red team cadence and automation coverage. A program that runs automated probes continuously against a deployed system has a MTTD measured in hours. A program that runs manual exercises quarterly has a MTTD measured in weeks.

4. Coverage completeness against a defined threat model#

Coverage completeness measures what percentage of your defined attack surface has been actively probed. It requires a threat model: a documented enumeration of the attack categories relevant to your deployment. The OWASP LLM Top 10 (2025) provides a standard starting taxonomy of ten risk categories. MITRE ATLAS provides a more granular attack technique library with over 80 documented ML-specific adversarial techniques mapped to mitigations.

For each category in your threat model, the question is binary: have we run adversarial probes against this category in the last [cadence period]? Coverage completeness is the percentage of categories with active probe coverage. A program that thoroughly tests prompt injection but has never probed its RAG ingestion pipeline for poisoning, or its system prompt for extraction, has significant coverage gaps that ASR alone will not reveal.

5. False positive rate in production guardrails#

A guardrail that blocks everything is not secure; it is broken. False positive rate measures what percentage of legitimate, benign inputs are incorrectly flagged or blocked by safety controls. This metric belongs in the red team framework because overly aggressive guardrails are a form of security failure: they degrade the product, drive workarounds, and create pressure to relax controls.

Track false positive rate alongside guardrail bypass rate. The two metrics are in tension by design. As you tighten guardrails to reduce bypass rate, false positive rate tends to increase. Understanding where your deployment sits on this tradeoff, and whether it is shifting over time, requires measuring both.

6. Cross-version regression delta#

Every model update is a potential security regression. A fine-tuning run that improves helpfulness scores may simultaneously relax safety boundaries. A system prompt change that fixes one attack vector may open another. Cross-version regression delta measures the change in ASR and bypass rate between model versions or deployment configuration changes.

This metric requires a fixed probe set: a standardized battery of adversarial inputs that is run against every version of the system and whose results are compared. Without a fixed probe set, comparing two red team exercises is like comparing two penetration tests run by different teams with different scopes. The results are not comparable and regression is invisible.

Building a baseline: what to measure first#

For programs that currently have no measurement infrastructure, the highest-value starting point is ASR across the OWASP LLM Top 10 categories. Run a structured probe battery against your current production deployment, record the results, and treat that as your baseline.

Every subsequent exercise is then answerable in terms of delta: "our direct jailbreak ASR decreased from 18% to 11%; our indirect injection ASR increased from 8% to 14% following the RAG pipeline expansion." That framing turns red teaming outputs into engineering inputs.

The baseline probe set should be versioned and maintained alongside the model itself. When the model is updated, the probe set runs automatically. When new attack categories emerge, they are added to the probe set and the coverage completeness metric updates accordingly. This is the architecture of a continuous red teaming program, and it aligns with the NIST AI Risk Management Framework's Measure function, which explicitly calls for ongoing adversarial testing as a core component of AI risk governance. It is also the architecture the essential guide to AI red teaming recommends as a maturity target.

How ARTEMIS makes red teaming measurable#

Manual red teaming produces findings. ARTEMIS, Repello's automated red teaming engine, produces measurements.

ARTEMIS runs structured probe batteries against AI deployments on a continuous or scheduled basis, recording ASR, bypass rate, and coverage completeness per session. Results are compared against prior runs automatically, surfacing regression deltas without requiring a human analyst to manually diff two reports. When a new attack technique is added to the probe library, it is immediately applied against all enrolled deployments and the coverage completeness metric updates in real time.

For security teams that need to demonstrate program effectiveness to leadership, ARTEMIS produces trend dashboards that show ASR and bypass rate over time, by attack category, across model versions. That is the artifact that answers "is our AI security posture improving?" with something more defensible than qualitative judgment.

The LLM pentesting checklist and tools guide covers the manual methodology that complements ARTEMIS's automated coverage: certain attack classes, particularly social engineering chains and multi-turn manipulation scenarios, still benefit from human adversarial creativity. The highest-maturity programs combine both.

Frequently asked questions#

What is attack success rate in AI red teaming?

Attack success rate (ASR) is the percentage of adversarial probes run against an AI system that successfully elicit a policy-violating or attacker-intended response. It is measured per attack category (prompt injection, jailbreak, data extraction, indirect injection) rather than as a single aggregate, so that exposure can be attributed to specific vulnerability classes. ASR is the primary output metric of any structured AI red teaming exercise.

How often should you run AI red teaming exercises?

Cadence should match the pace at which your attack surface changes. Any model update, system prompt change, RAG knowledge base expansion, or new tool integration is a potential regression that warrants a targeted probe run. For stable deployments, a continuous automated baseline probe cadence with monthly manual red team exercises covering novel attack techniques provides adequate coverage. Quarterly-only red teaming is insufficient given that mean time between public disclosure of new LLM attack techniques and active exploitation has collapsed to days.

What is a guardrail bypass rate?

Guardrail bypass rate measures the percentage of adversarial inputs that successfully evade a deployed safety layer and reach the underlying model without triggering a block or flag. It is distinct from attack success rate: bypass rate measures whether your detection controls fire correctly; ASR measures whether the model produces a harmful output when those controls fail. Both metrics are required to understand your full exposure.

What is cross-version regression testing in AI security?

Cross-version regression testing runs a fixed set of adversarial probes against successive versions of a deployed AI system and compares the results. It answers whether a model update, fine-tuning run, or configuration change has introduced new exploitable behaviors or relaxed existing safety boundaries. Without a versioned, fixed probe set, regression is invisible: two red team exercises with different scopes cannot be meaningfully compared.

How do you measure AI red teaming coverage?

Coverage completeness is measured against a defined threat model. For most enterprise AI deployments, the OWASP LLM Top 10 (2025) provides a standard baseline taxonomy of ten risk categories. Coverage completeness is the percentage of those categories for which active adversarial probes have been run within the measurement period. A program with 100% coverage completeness has tested its system against every category in its threat model; gaps indicate unaudited attack surface.

What is the difference between AI red teaming and AI penetration testing?

AI red teaming is an ongoing adversarial evaluation program designed to surface exploitable behaviors across a wide range of attack categories. AI penetration testing typically refers to a scoped, time-bounded engagement focused on finding specific exploitable vulnerabilities in a target system. In practice, red teaming produces the measurement framework (ASR baselines, coverage completeness scores, regression deltas) while pentesting produces specific exploitability findings. Both are necessary components of a mature AI security program.