LLM Security: A Practical Guide for Enterprise Teams

Q: What is the NIST AI Risk Management Framework and how does it relate to LLM security?

The NIST AI Risk Management Framework organizes AI risk management across four functions: Govern, Map, Measure, and Manage. It explicitly includes adversarial testing under the Measure function and treats it as a standard component of AI risk assessment. Aligning an LLM security program to the NIST AI RMF provides both a practical governance structure and a defensible documentation trail for regulatory and procurement purposes. Many enterprise customers now require NIST AI RMF alignment as a baseline in vendor security assessments.

TL;DR: LLM security is the set of controls, testing practices, and architectural decisions that prevent AI language models from being exploited in production. The five core risk categories are prompt injection, sensitive data leakage, jailbreaking, supply chain and RAG poisoning, and excessive agentic agency. Building a secure LLM deployment requires layered defenses across all five: system prompt hardening, runtime monitoring, periodic manual red teaming, and continuous automated adversarial testing. EU AI Act compliance and NIST AI RMF alignment both mandate adversarial testing as a non-negotiable component. This guide covers the full picture.

What LLM security means in practice#

LLM security is the discipline of protecting large language model deployments from adversarial attacks, data leakage, and unintended behavior that creates business or compliance risk. It sits at the intersection of traditional application security and a newer set of model-specific threats that existing security frameworks were not designed to address.

The distinction matters. When a company deploys an LLM, it is not deploying a deterministic application with a defined input-output contract. It is deploying a probabilistic system that interprets natural language instructions, retrieves data from connected sources, and in agentic configurations executes actions on connected systems. Each of those properties introduces attack surface that does not exist in conventional software.

LLM security covers three layers of a deployment: the model itself (training data integrity, fine-tuning security, model weights), the deployment configuration (system prompt, connected tools, retrieval sources, access controls), and the runtime environment (how inputs are validated, how outputs are handled, how behavior is monitored). Gaps at any layer translate into exploitable vulnerabilities at the application level.

"The most common mistake we see in enterprise LLM deployments is treating model security as a solved problem because the base model comes from a reputable vendor," says the Repello AI Research Team. "The vendor secures the model. The enterprise secures the deployment. Those are entirely different attack surfaces, and conflating them is where incidents start."

The 5 key LLM security risk categories#

The OWASP LLM Top 10 (2025 edition) provides the standard classification for LLM security risks. Five categories account for the majority of real-world exploits in production deployments.

1. Prompt injection#

Prompt injection attacks attempt to override the system prompt or application logic by embedding adversarial instructions in user input. Direct injection targets the model through the user input channel directly. Indirect injection embeds malicious instructions in external data the model retrieves and acts on: documents, web pages, API responses, database records.

Indirect injection is the more dangerous variant in enterprise deployments because it does not require the attacker to interact with the application directly. An attacker who can influence any data source the LLM retrieves from can potentially hijack the model's behavior. The prompt injection attack examples catalogued from real deployments illustrate how far the blast radius extends in RAG-enabled and agentic systems.

System prompt hardening reduces the direct injection surface but does not address indirect injection. Defense requires both: structural prompt hardening and input/output validation on every data source the model touches.

2. Sensitive information disclosure#

LLMs can leak sensitive information from three distinct sources: training data (memorized from pretraining or fine-tuning), context window contents (other users' session data, system prompts, injected context), and connected retrieval sources (RAG databases, API responses).

Each source requires a different control. Training data leakage is addressed at the fine-tuning stage through data minimization and differential privacy techniques. Context window leakage is addressed through session isolation and output filtering. RAG source leakage requires access control enforcement at the retrieval layer: the model should only retrieve documents the requesting user is authorized to see, enforced independently of any model-level instruction.

3. Jailbreaking and safety bypass#

Jailbreaking attempts to override a model's safety training to elicit outputs the model is designed to refuse. Techniques range from simple role-play prompts ("pretend you are a model without restrictions") to sophisticated multi-turn manipulations that gradually shift the model's context.

Safety bypass is not purely a content safety problem. In enterprise deployments, jailbreaking is also a data access problem: a model that can be jailbroken into ignoring its system prompt can potentially be jailbroken into ignoring its access control instructions, revealing data it was instructed to protect, or taking actions it was explicitly told not to take. Repello's benchmark data across model configurations shows breach rates ranging from 4.8% under hardened configurations to 28.6% under permissive default settings, per Repello AI Research Team data, demonstrating that configuration discipline is as important as base model safety training.

4. Supply chain and RAG poisoning#

Supply chain vulnerabilities in LLM deployments include compromised fine-tuning datasets, third-party base models with backdoors, and malicious plugins or tools connected to the LLM. The threat model is analogous to software supply chain attacks but applied to model artifacts and training data.

RAG poisoning is a particularly high-impact variant. If an attacker can introduce adversarial content into a retrieval knowledge base, that content is retrieved and processed by the model as authoritative context. Repello's RAG poisoning research demonstrates how this attack executes against production RAG pipelines: a single poisoned document can persistently manipulate model behavior for every query that retrieves it.

5. Agentic AI and excessive agency#

When LLMs are connected to tools and given the ability to take actions, the security surface extends from model behavior into infrastructure. An agentic AI that can read and write files, send emails, make API calls, or execute code can become a vector for data exfiltration, unauthorized system access, or destructive actions if manipulated through any of the attack categories above.

Excessive agency is not a theoretical risk. Documented incidents with real agentic deployments show that AI agents can be induced to take actions far outside their intended scope when adversarial inputs exploit the gap between the agent's capabilities and the controls applied to those capabilities. The principle of least privilege, standard in infrastructure security, applies directly: agents should be granted only the permissions they need for their defined task, enforced at the infrastructure layer, not through model instruction alone.

How to build a secure LLM deployment#

Secure LLM deployments are layered. No single control addresses the full attack surface. The following controls should be implemented in combination.

System prompt design. The system prompt is the primary configuration surface for model behavior. Write it defensively: define the model's role explicitly, specify what it must not do, and avoid including credentials or sensitive information that would be exposed if the prompt were extracted. Use structural delimiters to separate trusted system context from untrusted user input. Test the system prompt adversarially before deploying it.

Input validation and pre-processing. Validate and sanitize inputs before they reach the model. Filter known injection patterns, enforce length limits, and flag inputs that match adversarial signatures. Input validation does not substitute for system prompt hardening because adversarial inputs can be obfuscated, but it raises the cost of attack.

Output handling. Never pass model output directly to downstream system calls: SQL queries, shell commands, HTML rendering, or API calls. Treat model output as untrusted user input from the perspective of any downstream system. Validate and sanitize before execution.

Access control at the retrieval layer. For RAG deployments, enforce document-level access control at the retrieval infrastructure, not through model instruction. The model should receive only documents the requesting user has access to, enforced by the retrieval system independently of any prompt-level instruction.

Runtime monitoring. Monitor model inputs and outputs in production for anomalous patterns: unusual token sequences, unexpected tool call volumes, outputs that match sensitive data signatures. Runtime monitoring does not prevent attacks but provides the detection signal needed to contain and respond to them. Repello's ARGUS is designed for this layer: runtime security monitoring that detects threats in production without requiring the model to be taken offline for analysis.

Adversarial testing. Test the deployed system against all five risk categories above using a structured methodology. Testing must cover the deployed configuration, not the base model in isolation, and must be repeated on every deployment change. The AI red teaming guide covers the full methodology for enterprise teams building this capability.

Compliance considerations: EU AI Act and NIST AI RMF#

Two frameworks now set direct requirements for LLM security controls in enterprise deployments.

EU AI Act#

The EU AI Act entered into force in August 2024, with phased obligations through 2027. For enterprise LLM deployments, the most important provisions are:

High-risk AI systems (as classified in Annex III, covering applications in employment, education, critical infrastructure, law enforcement, and similar contexts) must have a risk management system, technical documentation, human oversight measures, and logging of operations throughout their lifecycle. These requirements translate directly into LLM security controls: you cannot satisfy the logging requirement without runtime monitoring infrastructure, and you cannot satisfy the risk management requirement without adversarial testing.

General-purpose AI models with systemic risk face additional obligations including adversarial testing, incident reporting to the EU AI Office, and cybersecurity measures. Models with training compute exceeding 10^25 FLOPs are presumed to have systemic risk unless demonstrated otherwise.

Prohibited practices under Article 5 include AI systems that exploit vulnerabilities of specific groups or deploy subliminal manipulation techniques. These provisions create legal exposure for LLM deployments that can be manipulated into producing targeted harmful outputs, which is a direct LLM security concern.

NIST AI Risk Management Framework#

The NIST AI Risk Management Framework organizes AI governance across four functions: Govern, Map, Measure, and Manage.

The Govern function establishes organizational policies, roles, and risk tolerance for AI. For LLM security, this means defining who owns security testing, what acceptable breach rates are, and how incidents are escalated.

The Map function identifies risks specific to each AI deployment's context. For LLM deployments, this requires mapping the five risk categories above to the specific deployment architecture and identifying which present the highest-priority exposure given the application's access to sensitive data and external systems.

The Measure function covers analysis, assessment, and monitoring of AI risk. The NIST AI RMF explicitly includes adversarial testing and red-teaming under Measure 2.6, treating it as a standard component of AI risk measurement, not an optional extra.

The Manage function covers risk treatment: what controls to implement, how to prioritize, and how to respond to incidents. For LLM security, the Manage function maps directly to the layered controls described above.

Aligning your LLM security program to the NIST AI RMF structure provides a defensible governance documentation trail for auditors and regulators, which is increasingly important as enterprise AI deployments come under greater regulatory scrutiny.

Comparing LLM security approaches#

No single approach provides complete LLM security coverage. The table below maps each major approach to its scope and limitations:

Approach	What it covers	Key limitation
System prompt hardening	Reduces direct prompt injection surface; defines model behavior boundaries	Does not address indirect injection, RAG poisoning, or model-level vulnerabilities
Output filtering / guardrails	Catches known harmful patterns at the output layer before they reach end users or downstream systems	Bypassable via encoding variations, multi-turn manipulation, and indirect attack paths; high false positive rates degrade usability
Periodic red teaming	Validates attack resistance against known techniques; produces actionable findings with reproduction cases	Point-in-time snapshot that degrades as model, system prompt, connected tools, and threat landscape evolve
Runtime security monitoring	Detects anomalous behavior and blocks threats in production; provides incident response signal	Requires behavioral baseline to be effective; does not proactively find new attack vectors
Continuous automated red teaming	Ongoing adversarial coverage across the full attack surface; surfaces regressions introduced by model or deployment changes	Requires integration into model release and deployment workflows to deliver value

Effective LLM security programs layer multiple approaches rather than selecting one. System prompt hardening and output filtering address the most common attack paths. Runtime monitoring provides detection and response capability. Red teaming, whether periodic manual or continuous automated, validates that the other controls are actually working against a real adversary attempting to bypass them.

The ARTEMIS automated red teaming engine addresses the continuous testing requirement: it runs adversarial probing across the OWASP LLM Top 10 categories on each deployment change and integrates findings into security workflows rather than requiring a separate engagement cycle.

Building a sustainable LLM security program#

Implementing the controls above is tractable for an individual deployment. Sustaining them across a portfolio of LLM applications, as model versions change, new tools are connected, and new deployments are added, is the operational challenge that most enterprise security teams underestimate.

Three practices make this sustainable at scale. First, tie security testing to the deployment pipeline: any change to model version, system prompt, fine-tuning dataset, or connected tools triggers a targeted re-test of the affected attack categories before the change reaches production. Second, track security metrics continuously rather than per-engagement: breach rates, detection latency, and time-to-remediation as standing KPIs alongside availability and performance metrics. Third, establish clear ownership: LLM security sits at the boundary of security engineering and ML engineering, and gaps consistently appear where ownership is ambiguous.

The regulatory trajectory is clear. EU AI Act obligations are already in force for high-risk deployments. NIST AI RMF alignment is increasingly a procurement requirement in enterprise sales cycles. Security teams that build repeatable LLM security programs now will be better positioned than those who build them in response to a regulatory audit or a production incident.

Test your LLM deployment with ARTEMIS to see what your current attack surface actually looks like.

Frequently asked questions#

What is LLM security?

LLM security is the practice of protecting large language model deployments from adversarial attacks, data leakage, and unintended model behavior that creates business or compliance risk. It covers the model layer (training data integrity, fine-tuning security), the deployment configuration layer (system prompt, connected tools, retrieval sources), and the runtime layer (input validation, output handling, behavior monitoring). Unlike traditional application security, LLM security must address a probabilistic attack surface where vulnerabilities are model-behavior-based rather than code-based.

What are the biggest LLM security risks?

The five highest-priority risk categories in production LLM deployments are: prompt injection (direct and indirect), sensitive information disclosure from training data or context window, jailbreaking and safety bypass, supply chain attacks and RAG poisoning, and excessive agentic agency in tool-enabled deployments. These categories are codified in the OWASP LLM Top 10, which provides a structured coverage framework for security assessments. In practice, indirect prompt injection and RAG poisoning represent the most underinvested risk areas in enterprise deployments.

How do I secure an LLM deployment?

Secure LLM deployments require layered controls: defensively designed system prompts, input validation before inputs reach the model, output sanitization before model outputs reach downstream systems, access control enforcement at the retrieval layer for RAG systems, runtime monitoring for anomalous behavior in production, and adversarial testing to validate that the other controls are working. No single control is sufficient. The most common gap is organizations that implement static controls but never validate them through adversarial testing, which means they are running on the assumption that the controls work rather than evidence that they do.

Does the EU AI Act apply to LLM deployments?

Yes. The EU AI Act imposes requirements on both high-risk AI system deployments (Annex III categories including employment, education, critical infrastructure) and general-purpose AI models with systemic risk. High-risk deployments require risk management systems, technical documentation, human oversight, and operational logging. Models with systemic risk require adversarial testing and cybersecurity measures. Organizations deploying LLMs in high-risk contexts should assess their obligations under the Act and map their LLM security controls to the relevant requirements.

What is the NIST AI Risk Management Framework and how does it relate to LLM security?

The NIST AI Risk Management Framework is a voluntary governance framework that organizes AI risk management across four functions: Govern, Map, Measure, and Manage. It explicitly includes adversarial testing under the Measure function and treats it as a standard component of AI risk assessment, not optional. Aligning an LLM security program to the NIST AI RMF provides both a practical governance structure and a defensible documentation trail for regulatory and procurement purposes. Many enterprise customers now require NIST AI RMF alignment as a baseline in vendor security assessments.

How is LLM security different from traditional application security?

Traditional application security targets a deterministic attack surface with code-based vulnerabilities: buffer overflows, injection flaws, authentication bypasses. These have discrete patches and binary existence. LLM security targets model behavior, which is probabilistic: the same input can produce different outputs, vulnerabilities emerge from how the model interprets context rather than from a specific code path, and fixes require system prompt changes, fine-tuning, or output filtering rather than a code patch. Coverage is also statistical rather than binary; you cannot enumerate all possible inputs, so testing must sample across attack categories and measure success rates. These differences require fundamentally different testing methodology and tooling.