What Is Breach and Attack Simulation (BAS) for AI Systems?

Q: How often should you run BAS against an AI system?

Continuously. AI systems change in ways that reset the attack surface without any explicit deployment event: a RAG knowledge base update can introduce poisoned documents, a prompt template change can create a new injection vector, and a fine-tuning run can alter safety behavior in ways that existing guardrails no longer cover. A BAS cycle triggered only by deployment events misses all of these. The NIST AI Risk Management Framework explicitly requires continuous monitoring of AI systems in production. Continuous BAS is the mechanism that makes that requirement operationally feasible at scale.

TL;DR: Breach and attack simulation automates adversarial testing to continuously validate security controls, map MITRE ATT&CK coverage, and find gaps without the cost and calendar constraints of a full red team engagement. Traditional BAS platforms do this well for network and endpoint infrastructure. They were not designed for LLMs, RAG pipelines, or agentic tool layers, and they show it: a BAS tool that fires CVE-mapped exploits at an inference endpoint is testing the transport layer, not the model. BAS for AI requires test coverage across five distinct attack surfaces: input/output manipulation, retrieval layer poisoning, agentic and tool-call hijacking, model-layer attacks, and runtime guardrail evasion. BAS for AI is not a feature toggle. It's a different discipline.

Gartner coined the term "breach and attack simulation" to describe a category of platforms that automated what red teams had been doing manually: running adversarial scenarios against an organization's controls to measure whether they actually block what they claim to block. By 2024, BAS had become a standard component of enterprise security programs, running continuously against network segments, endpoint configurations, and identity infrastructure. Then enterprise AI deployments scaled, and the same security teams that ran BAS against their firewalls and EDR tools asked the obvious question: can we run this against our LLMs too?

Most existing BAS platforms answered "yes." The more accurate answer is "not in any way that matters." Testing an LLM deployment with a platform built to validate NGFW rules and EDR coverage misses nearly every attack surface that makes AI systems dangerous. The attack surface is different. The failure modes are different. The test methodology has to be different.

What Is Breach and Attack Simulation?#

Breach and attack simulation is the practice of running automated, repeatable adversarial scenarios against a production or staging environment to validate that security controls behave as configured. Unlike a penetration test, which is point-in-time and human-led, BAS runs continuously, producing a current-state view of control coverage rather than a snapshot from the last time a consultant was on-site.

Three use cases define the traditional BAS category:

Threat control validation. Run a known attack technique against a control and verify that the control fires. If a next-gen firewall claims to block lateral movement via SMB exploitation, run an SMB exploit against the network and confirm the block. If it does not block it, the control is misconfigured or ineffective, regardless of what the vendor dashboard says.

MITRE ATT&CK coverage mapping. MITRE ATT&CK documents over 600 adversary techniques across 14 tactic categories. BAS platforms map their test scenarios to ATT&CK technique IDs, so security teams can see which techniques their current control stack covers, which it misses, and which it detects but does not block. This gives security programs an objective, externally validated coverage metric rather than a vendor-supplied compliance checkbox.

Red team augmentation. Red team engagements are expensive, time-intensive, and by definition periodic. BAS fills the between-engagement gap: it does not replace the human judgment and novel attack chain discovery of a skilled red team, but it ensures that the controls validated in the last engagement have not decayed in the weeks or months since it concluded.

Classic BAS targets network infrastructure, endpoint configurations, cloud IAM policies, and identity provider integrations. It is built around the assumption that the thing being tested has deterministic behavior: a firewall either blocks port 445 or it does not. That assumption breaks completely when the thing being tested is a language model.

Why Traditional BAS Does Not Cover AI Attack Surfaces#

Four structural gaps explain why porting a traditional BAS platform into an AI security context produces coverage that looks comprehensive and is not.

Attack surface mismatch. Traditional BAS tests code paths and network services. AI systems fail through model behavior, not code logic. An LLM does not have a buffer overflow. It does not have an open port serving an unpatched service. A BAS tool that maps its scenarios to CVE IDs and MITRE ATT&CK technique codes has no test cases for the failure modes that actually matter: a model that leaks system prompt contents when asked to summarize a long document, or an agent that exfiltrates credentials because a retrieved web page told it to. The attack surface is the model's behavior in context, not the infrastructure around it.

No prompt injection coverage. Prompt injection has no equivalent in traditional application security. There is no CVE for it. There is no firewall rule that blocks it. OWASP classifies prompt injection as the top risk in LLM deployments precisely because it exploits the model's core function, treating user-supplied or retrieved content as trusted instructions. A BAS platform built on CVE-mapped technique libraries has no test cases for indirect prompt injection through a poisoned RAG document, for tool-call hijacking through an attacker-controlled MCP server response, or for cross-agent injection in a multi-agent orchestration pipeline. These are not edge cases. They are the primary attack class in deployed AI systems.

Probabilistic failure modes. Traditional BAS expects binary outcomes: a control blocks an attack or it does not. LLM vulnerabilities are probabilistic. A jailbreak technique may succeed on 15% of attempts against a given model and configuration. A guardrail bypass may work consistently when the model's context window is full and fail reliably when it is empty. A BAS platform that runs a prompt injection test once and records "no impact" has not validated the control. It has run one trial of a probabilistic experiment and reported the outcome of that trial. Statistically meaningful coverage requires running each test scenario across hundreds of variations and reporting attack success rate (ASR) as a distribution, not a binary flag.

Agentic complexity. Most existing BAS tools test AI systems by sending requests to a chat API endpoint and evaluating the response. That test surface covers roughly 10% of the actual attack surface in an agentic deployment. An agent with file system access, email tool integration, and web browsing capability has an attack surface that includes every data source it can read, every system it can write to, and every external service it can call. Research from the University of Illinois Urbana-Champaign found that LLM agents can autonomously exploit real-world vulnerabilities with an 87% success rate when given access to tools without sufficient controls. Testing the chat API endpoint misses all of that.

What BAS for AI Systems Actually Needs to Test#

An AI BAS platform needs to cover five distinct attack surfaces. The LLM pentesting methodology maps these directly; a BAS platform needs to automate that methodology continuously.

Input/output layer#

This is the layer traditional BAS tools partially address, and even here the coverage is usually shallow. A complete test of the input/output layer covers: direct prompt injection (system prompt override attempts, role reassignment, jailbreak variants across encoding, language, and framing); Unicode-based bypasses using variation selectors, homoglyphs, and zero-width characters to evade keyword filters; multi-turn manipulation sequences that establish false context over several conversational turns before issuing the malicious instruction; and output manipulation that induces the model to return harmful content, PII, or confidential system instructions.

BAS for this layer requires not just a fixed library of known jailbreak prompts but a generation mechanism that creates novel variants. Attack patterns observed in the wild evolve continuously, and a static library of 2024 jailbreak techniques provides increasingly thin coverage against 2026 deployments.

Retrieval layer#

RAG pipelines introduce an entirely new attack surface that has no equivalent in traditional BAS. RAG poisoning works by introducing adversarial content into the documents or knowledge base an LLM retrieves at query time: the model's output then reflects the poisoned content as if it were authoritative. Test scenarios for this layer include knowledge base manipulation (inserting adversarial documents that contain instruction-format text designed to redirect the model), embedding space attacks (crafting inputs that produce retrieval of unintended documents by manipulating semantic similarity scores), and cross-document injection (placing a malicious instruction in document A such that it executes when the model is responding to a question about document B).

A BAS platform that does not have access to the retrieval pipeline cannot test this layer at all. Testing only the model API misses the entire RAG attack surface.

Agentic and tool-call layer#

This is the attack surface that grows fastest as enterprises expand AI deployments. MCP prompt injection works by placing attacker-controlled instructions inside the response that an MCP server returns to an agent: the agent reads the tool response, interprets the embedded instructions as legitimate, and executes actions the operator never intended. BAS test scenarios for this layer need to cover: tool call hijacking through malicious tool response content; cross-agent injection in multi-agent pipelines, where a compromised subordinate agent passes attacker instructions to a privileged orchestrator; indirect injection through any external data source the agent reads (email, calendar, documents, web pages); and permission escalation attempts that try to get an agent to access resources outside its defined scope.

"Agentic AI security testing requires running the full deployed application stack, not just the model. The attack surface is the agent's behavior across a session, including the tools it calls, the data it retrieves, and the actions it takes in downstream systems," said the Repello AI Research Team. "A BAS platform that tests the inference endpoint in isolation is missing the majority of the risk."

Model layer#

Model-layer attacks are less common in production deployments today but represent a growing threat as more organizations use fine-tuned models. Fine-tuning backdoors work by embedding a trigger pattern in training data: when the trigger appears at inference time, the model's behavior deviates from expected in a predictable and attacker-controlled way. Training data extraction attacks attempt to reconstruct memorized training data from model outputs, potentially recovering PII, proprietary code, or other sensitive content that was present in the training set. Membership inference attacks determine whether specific data records were in the training set, enabling targeted follow-on attacks.

BAS scenarios for this layer typically require access to the model checkpoint or fine-tuning pipeline, which means they are most relevant for teams operating their own model infrastructure rather than calling a hosted API.

Runtime layer#

The runtime layer covers the guardrail and safety infrastructure that sits between the user and the model. Repello's research on Meta's Prompt Guard documented that a well-resourced guardrail failed systematically against attacks outside its training distribution, not because the guardrail was poorly implemented but because a static classifier cannot generalize to attack classes it was not trained to detect. BAS test scenarios for the runtime layer include: guardrail evasion through rephrasing, encoding, or multilingual substitution; context window manipulation that exploits position bias to cause safety classifiers to mislabel a prompt; and denial-of-wallet attacks that trigger expensive inference operations at scale, producing financial impact without requiring any safety violation.

BAS vs. Red Teaming for AI: What's the Difference?#

BAS and red teaming answer different questions. BAS answers: "Are my controls working right now, across the full breadth of known attack techniques?" Red teaming answers: "What can a skilled adversary do to my system that I have not anticipated?" Both questions matter. Neither answer substitutes for the other.

Dimension	BAS for AI	AI Red Teaming
Frequency	Continuous, runs after every model update or config change	Periodic: pre-launch, post-update, quarterly
Coverage method	Automated scenarios across known attack taxonomy	Human-driven novel attack chain discovery
Human involvement	Minimal: reviewed after automated runs	Central: red teamers design and execute
Output type	ASR metrics, coverage gaps, regression signals	Attack narratives, novel chains, blast radius mapping
Best for	Validating controls haven't degraded, tracking coverage over time	Finding the attack classes your current controls don't know to look for
Limitations	Cannot discover novel attacks outside existing taxonomy	High cost, periodic coverage only, not continuous

Most enterprise AI security programs need both. BAS provides the continuous breadth coverage that ensures controls do not silently degrade between red team engagements. Red teaming provides the depth and adversarial creativity that discovers attack chains BAS has no test cases for yet. The same logic applies in traditional security: organizations run continuous BAS and periodic penetration tests because each covers the blind spots of the other.

What to Look for in an AI BAS Solution#

Six evaluation criteria that distinguish platforms that actually cover the AI attack surface from those that adapted an existing product with a marketing layer:

Does it test the full deployed stack or just the model API in isolation? A platform that sends prompts to an API endpoint and evaluates responses is testing the model, not the deployment. The RAG pipeline, the tool integrations, the agentic workflow, and the guardrail stack are all part of the attack surface. Any platform that cannot be pointed at a live deployed application rather than a model endpoint is not testing what attackers will target.

Does it generate attack variations automatically? A fixed library of known jailbreak techniques provides diminishing returns as defenses adapt. An AI BAS platform needs an attack generation layer that produces novel variants from known techniques, covering encoding variations, linguistic reformulations, and multi-turn sequences that a static library does not include.

Does it cover the OWASP LLM Top 10 systematically? The OWASP LLM Top 10 provides an independent, publicly maintained taxonomy of LLM risk categories. A platform that cannot map its test coverage to specific OWASP LLM Top 10 entries gives you no external benchmark for completeness. Ask specifically for coverage evidence against LLM01 (prompt injection), LLM03 (supply chain), LLM06 (excessive agency), and LLM08 (model theft).

Does it handle multi-turn attacks? Single-prompt probes miss a large class of real attacks. Gradual goal hijacking, multi-turn jailbreaking, and context manipulation attacks all operate over multiple conversational turns. A BAS platform that evaluates only single-turn interactions will not catch the attack classes that rely on accumulated context to work.

Does it produce findings by blast radius and exploitability, not just binary pass/fail? "Jailbreak succeeded" is not an actionable finding. "Jailbreak succeeded on 23% of attempts, combined with tool call access this produces a credential exfiltration path with confirmed file system write capability" is an actionable finding. Severity scoring needs to reflect what the attacker can actually do with a successful exploit, not just whether the exploit landed.

Does it run continuously against production, or only in a staging environment? AI model behavior can change without a deployment event: RAG knowledge bases update, fine-tuned model versions roll out, prompt templates change, context window sizes shift. A BAS platform that only runs in staging misses all of these. Continuous production testing with safety guardrails on the test traffic itself is the standard for mature AI security programs.

How ARTEMIS Approaches AI Breach and Attack Simulation#

ARTEMIS is Repello's automated AI red teaming engine, built to cover all five attack surfaces described above against the live deployed application stack rather than an isolated model endpoint. It does not adapt a traditional BAS platform for AI. It was built specifically for the AI attack surface.

ARTEMIS generates context-specific attack scenarios tailored to the application being tested: a customer service agent gets a different attack set than a code generation assistant, because the blast radius of a successful attack is different and the relevant attack classes differ. Its 15M+ evolving attack patterns span the OWASP LLM Top 10, NIST AI RMF, and MITRE ATLAS adversarial ML taxonomy, and they are updated as new attack classes emerge rather than remaining static.

ARTEMIS Browser Mode extends test coverage to AI agents that interact with web interfaces: it simulates the full user interaction loop, including the tool calls and external data retrievals that form the actual attack surface of a deployed agentic system. This is the attack surface no traditional BAS platform addresses because no traditional BAS platform was designed to test an agent that reads email, browses the web, and executes code.

Outputs are prioritized by blast radius and exploitability, mapped to compliance frameworks, and structured for security engineering teams to act on directly rather than requiring translation from a red team narrative.

For teams evaluating AI BAS options: the useful comparison is not feature parity against traditional BAS platforms. It is coverage across the five attack surfaces above, and whether the platform produces findings that translate to specific remediations in the current deployment.

See how ARTEMIS covers the AI breach and attack simulation use case.

Frequently Asked Questions#

What is breach and attack simulation in cybersecurity? Breach and attack simulation (BAS) is a category of security tooling that automates adversarial testing to continuously validate whether security controls behave as configured. It runs attack scenarios from a known technique library against a production or staging environment, maps results to frameworks like MITRE ATT&CK, and surfaces gaps in control coverage. Gartner coined the term and has tracked it as a distinct product category since 2017. Traditional BAS targets network, endpoint, and identity infrastructure. AI-specific BAS extends the methodology to cover LLM vulnerabilities, RAG pipelines, agentic tool layers, and model-level attacks that classical BAS platforms were not designed to test.

What's the difference between BAS and penetration testing? The primary difference is continuity. Penetration testing is point-in-time: a team of security engineers attacks the system for a defined engagement period, produces a report, and leaves. BAS is continuous: automated scenarios run after every configuration change, model update, or new deployment. Penetration testing provides depth and novel attack chain discovery that automated tools cannot replicate. BAS provides continuous breadth coverage that ensures control effectiveness has not degraded between engagements. Mature security programs use both: BAS to maintain current-state control coverage visibility, penetration testing to discover what the BAS library does not yet know to test for.

Can existing BAS tools test AI and LLM applications? Not adequately. Traditional BAS platforms are built around CVE-mapped attack techniques targeting deterministic systems. LLM vulnerabilities are probabilistic, behavior-based, and require test methodologies that do not map to standard CVE libraries: prompt injection, RAG poisoning, jailbreaking, and agentic tool hijacking have no equivalent in network or endpoint security. Some traditional BAS vendors have added AI-related test modules, but these typically cover the chat API endpoint in isolation, missing the RAG pipeline, tool-call layer, and multi-agent orchestration surfaces where the majority of real AI attack paths operate.

What's the difference between BAS for AI and LLM red teaming? LLM red teaming is human-led, periodic, and focused on discovering novel attack chains that the current test library does not include. BAS for AI is automated, continuous, and focused on validating coverage across known attack techniques after every change. Red teaming produces depth: it finds the attack class you did not know to look for. BAS produces breadth: it verifies that the controls your red team validated last quarter have not silently decayed. Both are necessary components of an enterprise AI security program; neither replaces the other.

How often should you run BAS against an AI system? Continuously, not periodically. AI systems change in ways that reset the attack surface without any explicit deployment event: a RAG knowledge base update can introduce poisoned documents; a prompt template change can create a new injection vector; a fine-tuning run can alter safety behavior in ways that existing guardrails no longer cover. A BAS cycle triggered only by deployment events misses all of these. The NIST AI Risk Management Framework explicitly requires continuous monitoring of AI systems in production, treating AI security as an ongoing operational function rather than a pre-launch gate. Continuous BAS is the mechanism that makes that requirement operationally feasible at scale.