Back to all blogs

|
|
9 min read


TL;DR: LLM pentesting covers five distinct attack surfaces: input/output, retrieval, agentic/tool-call, model layer, and runtime. A complete pentest requires a 5-phase methodology, a 30-item surface-specific checklist, and tools matched to your deployment architecture. This guide covers all three, plus a comparison table and decision framework for choosing the right toolset.
What LLM Pentesting Actually Tests
Standard application pentesting targets a well-defined codebase. LLM pentesting is different: the attack surface shifts with every model update, every new tool integration, and every change to the system prompt. You are testing a probabilistic system, not a deterministic one.
The five surfaces every LLM pentest must cover:
Input/output layer: prompt injection, jailbreaking, encoding-based bypasses (Unicode variation selectors, Base64, token smuggling), output manipulation.
Retrieval layer: RAG poisoning, knowledge base manipulation, embedding space attacks, indirect injection through retrieved documents.
Tool-call and agentic layer: tool call hijacking, indirect prompt injection through external data sources, MCP server poisoning, cross-agent injection in multi-agent pipelines, privilege escalation through chained tool calls.
Model layer: fine-tuning backdoors, training data extraction, membership inference, adversarial examples against the base model.
Runtime layer: denial-of-wallet attacks through token exhaustion, context window manipulation, guardrail evasion, sycophancy exploitation, system prompt extraction.
Most 2024-era pentesting frameworks focused almost entirely on surface 1. Production attacks in 2025–2026 predominantly target surfaces 3 and 5.
LLM Pentesting Methodology
Phase 1: Threat Modeling
Before touching the application, map what you're testing.
Identify trust boundaries: What data can the LLM access? What actions can it take? What happens downstream from its outputs?
Identify integration points: Is this a standalone chatbot, a RAG-backed assistant, an agentic system with tool access, or a multi-agent pipeline?
Identify the blast radius: If the model is fully compromised, what can an attacker read, write, or execute?
Agentic systems require separate threat modeling. A model with access to email, calendar, and file system tools has a fundamentally different attack surface than a Q&A chatbot. Research published on arXiv in 2025 found that multi-agent pipelines with tool access are 3.4x more susceptible to successful exploitation than single-model deployments.
Phase 2: Reconnaissance
Extract the system prompt (or infer its contents) through direct and indirect probing
Map available tool calls and their permissions
Identify the underlying model and version (relevant for known model-specific bypasses)
Identify any pre/post processing filters; test with benign payloads to establish baseline behavior
Check output format: streaming vs. batched affects certain injection timing attacks
Phase 3: Manual Probing
This is where most of the real findings come from. Automated scanners miss context-specific vulnerabilities. A human tester understands the application's intent and can craft payloads that exploit the specific system prompt logic.
Prompt injection and jailbreaking
Test direct injection (malicious user input), indirect injection (malicious content in retrieved documents, tool outputs, or external data), and nested injection (instructions inside instructions). The OWASP LLM Top 10 classifies prompt injection as LLM01: the highest-priority risk for deployed language model applications.
Encoding bypasses
Test: Base64, ROT13, Unicode variation selectors (VS1–VS16), zero-width characters, homoglyphs, leetspeak, and BPE tokenization splits. The Repello AI Research Team demonstrated in original research that Unicode variation selectors can encode full attack payloads inside a single emoji character, bypassing commercial guardrail products including Azure Prompt Shield.
RAG and retrieval attacks
If the system uses retrieval-augmented generation, inject adversarial content into the knowledge base or test how the model handles retrieved documents containing conflicting or malicious instructions. RAG poisoning can cause aligned models to produce harmful or biased outputs at scale without any change to the model itself.
Agentic and tool-call attacks
For systems with tool access, test whether malicious instructions embedded in external content (emails, web pages, documents) can hijack tool calls. Research demonstrated zero-click data exfiltration from Google Drive through a single malicious email processed by an AI agent with no user interaction required. The MCP protocol introduces additional attack surface: poisoned tool descriptions, cross-server privilege escalation, and remote code execution through malicious schema definitions.
System prompt extraction
Attempt to extract the full system prompt through direct instruction, roleplay scenarios, and token prediction attacks. Leaked system prompts expose business logic, safety rule bypasses, and architectural details useful for further attacks.
Guardrail evasion
Breaking Meta's Prompt Guard documented a structural failure mode: guardrail classifiers run a different tokenizer than the underlying model, meaning payloads encoded in variation selectors are stripped before classification but processed by the model. Test whether the guardrail and the model see the same input.
Phase 4: Automated Scanning
Manual testing finds the high-value logic-layer vulnerabilities. Automated scanning covers breadth: systematically checking hundreds of known injection patterns, jailbreak variants, and encoding bypasses that no human tester would enumerate manually.
Run automated scanning in parallel with manual work, not as a replacement for it. For teams that want to understand how automated and manual approaches compare structurally, breach and attack simulation for AI systems covers the tradeoffs in depth.
Phase 5: Reporting and Remediation
For each finding, document:
Attack vector and payload used
Pre-conditions (e.g., specific RAG configuration, tool access required)
Impact: data exfiltration, unauthorized action execution, guardrail bypass, etc.
Reproduction steps
Remediation recommendation
Prioritize by blast radius, not just exploitability. A prompt injection that leaks the system prompt is lower severity than one that hijacks a tool call with file system write access.
2026 LLM Pentesting Checklist
Input / Output Layer
[x] Direct prompt injection (user turn)
[x] Indirect prompt injection (retrieved documents, tool outputs)
[x] System prompt extraction
[x] Jailbreak via roleplay, persona assignment, and hypothetical framing
[x] Encoding bypass: Base64, ROT13, Unicode variation selectors, zero-width characters
[x] BPE tokenization split attacks
[x] Output manipulation (format override, data exfiltration via structured output)
[x] Multilingual bypass: test with non-Latin scripts if the system prompt is English-only
RAG / Retrieval Layer
[x] Adversarial document injection into knowledge base
[x] Conflicting instruction injection via retrieved content
[x] Embedding space manipulation (if access to indexing pipeline)
[x] Cross-document injection chaining
[x] PII leakage through retrieval
Agentic / Tool-Call Layer
[x] Tool call hijacking via indirect injection
[x] Privilege escalation through chained tool calls
[x] MCP server poisoning (malicious tool descriptions)
[x] Cross-agent injection in multi-agent pipelines
[x] Unauthorized action execution (file write, email send, API call)
[x] Zero-click exfiltration via agent-processed external content
Model Layer
[x] Training data extraction (membership inference)
[x] Fine-tuning backdoor testing (if custom fine-tuned model)
[x] Adversarial example generation
[x] Sycophancy exploitation (overriding model outputs through social pressure)
Runtime / Guardrail Layer
[x] Denial-of-wallet via token exhaustion
[x] Guardrail evasion (verify guardrail and model see the same tokenized input)
[x] Context window overflow / poisoning
[x] System prompt confidentiality (extraction and partial extraction)
[x] Rate limit and abuse control testing
Best Tools for LLM Pentesting in 2026
Tools Comparison
Tool | Type | Coverage layer | Open source | Best for |
|---|---|---|---|---|
Garak | Automated scanner | Input/output | Yes (NVIDIA) | Broad baseline scanning, 100+ probe categories |
PyRIT | Attack framework | Input/output, agentic | Yes (Microsoft) | Scripted multi-turn attack sequences |
PromptFuzz | Fuzzer | Input/output | Yes | Encoding bypass discovery, edge cases |
PurpleLlama / LlamaGuard | Benchmark suite | Input/output | Yes (Meta) | Guardrail benchmarking, safety evaluation |
PromptBench | Evaluation framework | Input/output | Yes (Microsoft) | Adversarial robustness evaluation |
HarmBench | Safety benchmark | Input/output | Yes (CAIS) | Cross-model safety comparison |
Rebuff | Injection detector | Input layer | Yes | Prompt injection detection and filtering |
Inspect AI | Evaluation framework | Input/output, model | Yes (AISI) | Structured safety evaluations |
LangChain / LangSmith | Agentic framework + eval | Agentic, all layers | Mixed | Agentic pipeline testing and tracing |
ARTEMIS | Full-stack red teaming | All five surfaces | No (Repello) | Production deployments, agentic and MCP systems |
1. Garak
Garak is NVIDIA's open-source LLM vulnerability scanner. It runs automated probes across 100+ attack categories including prompt injection, hallucination, data leakage, and jailbreaking, generating structured security reports. Best for broad coverage scans at the input/output layer.
GitHub: 4,200+ stars | Use case: Automated baseline scanning | Limitation: Coverage is shallow on agentic and tool-call surfaces; does not test RAG pipelines or production system context.
2. PyRIT (Python Risk Identification Toolkit)
PyRIT is Microsoft's red teaming framework for generative AI. It supports multi-turn attack orchestration, allows custom attack strategies, and integrates with Azure OpenAI deployments. Well-suited for red teamers who want to script complex attack sequences.
GitHub: 2,100+ stars | Use case: Scripted multi-turn attack automation | Limitation: Requires significant engineering investment to configure for a specific application; coverage is only as good as the attack scripts you write.
3. PromptFuzz
PromptFuzz applies fuzzing methodology to LLM inputs: systematically mutating prompts to discover bypasses that rule-based injection lists would miss. Particularly effective for finding encoding-based bypasses and model-specific edge cases.
GitHub: 800+ stars | Use case: Fuzzing and edge case discovery | Limitation: Output interpretation requires manual triage; does not produce actionable remediation guidance.
4. PurpleLlama / LlamaGuard
Meta's PurpleLlama suite includes LlamaGuard (input/output classifier), CyberSecEval (security benchmark), and PromptGuard. Useful for evaluating model robustness against known attack categories and as a baseline comparator for custom guardrail configurations.
GitHub: 3,100+ stars | Use case: Benchmark evaluation, guardrail testing | Limitation: Benchmarks reflect known attack categories; novel or application-specific attack chains require manual testing on top.
5. PromptBench
PromptBench is Microsoft Research's unified evaluation library for assessing LLM adversarial robustness. It provides standardized attack prompts across multiple adversarial categories (typos, character manipulation, word-level and sentence-level attacks) and outputs robustness scores for direct model comparison.
GitHub: 2,400+ stars | Use case: Adversarial robustness benchmarking across models | Limitation: Tests the model in isolation; does not account for system prompt logic, retrieval augmentation, or tool integrations that modify model behavior in production.
6. HarmBench
HarmBench is a standardized safety evaluation benchmark from the Center for AI Safety, covering 400+ harmful behaviors across 7 functional categories. It provides a consistent attack methodology for comparing safety across models and fine-tunes, making it particularly useful for evaluating model regression after updates.
GitHub: 900+ stars | Use case: Cross-model safety comparison, fine-tune regression testing | Limitation: Covers safety-relevant harmful behaviors, not security vulnerabilities (prompt injection, tool hijacking, data exfiltration). Complements security testing; does not replace it.
7. Rebuff
Rebuff is an open-source prompt injection detection framework from Protect AI. It combines an LLM-based classifier, a vector database of known injection patterns, and a canary token system that detects when injected instructions cause the model to leak internal context. Designed to integrate directly into application pipelines.
GitHub: 2,000+ stars | Use case: Prompt injection detection and pipeline-level filtering | Limitation: Detection-focused rather than attack-generation-focused; useful for building defenses but not for comprehensive pentest coverage.
8. Inspect AI
Inspect AI is the evaluation framework developed by the UK AI Safety Institute (AISI). It provides a structured task and scoring architecture for evaluating LLM capabilities and safety properties across custom evaluation sets. Used internally by AISI for frontier model evaluations.
GitHub: 1,800+ stars | Use case: Structured safety evaluations, AISI-aligned testing methodology | Limitation: Framework-oriented; requires writing evaluation tasks rather than running out-of-box attack batteries.
9. LangChain / LangSmith
LangChain is the most widely deployed open-source framework for building agentic LLM applications (100,000+ GitHub stars). LangSmith is the companion evaluation and observability platform: it traces every step in an agentic pipeline, logs tool calls, and supports structured evaluation runs. For teams already building with LangChain, LangSmith provides the closest approximation to continuous agentic security testing available in a commercial platform.
GitHub (LangChain): 100,000+ stars | Use case: Agentic pipeline evaluation and trace-level debugging | Limitation: LangSmith is a commercial SaaS product; evaluation quality depends on the test cases you configure. It covers observability and functional testing, not adversarial attack simulation.
10. ARTEMIS (Repello AI)
ARTEMIS is Repello's automated red teaming engine built specifically for production LLM deployments. Unlike open-source scanners that test models in isolation, ARTEMIS tests the full application stack: RAG pipelines, tool integrations, and agentic workflows against the live system.
ARTEMIS covers all five attack surfaces in the methodology above, generates attacker-perspective reports with exploitation evidence, and includes ARTEMIS Browser Mode for red teaming AI agents that interact with web interfaces. For teams that need to pentest agentic systems and MCP integrations at scale and produce findings their engineering and security teams can act on, ARTEMIS replaces the manual coordination overhead of running Garak, PyRIT, and PromptFuzz separately.
How to Choose
What attack surfaces does your deployment expose?
If your deployment is a standalone chatbot, Garak or PromptBench provide adequate automated coverage for input/output layer testing. If you are running an agentic system with RAG, tool integrations, or MCP connections, you need a framework that tests the full application stack against live production behavior. ARTEMIS is the only tool in this list that covers all five surfaces against production deployments; for bespoke attack scripting, PyRIT or Inspect AI can be layered on top.
Do you need continuous testing or a point-in-time assessment?
Open-source tools run as point-in-time scripts: you schedule them, interpret the output, and act on findings manually. ARTEMIS and LangSmith support continuous testing workflows where each model or system update triggers an automated re-assessment. For teams with frequent deployment cadences, continuous coverage is significantly more effective than scheduled point-in-time scans. The case for continuous AI red teaming covers the operational argument in detail.
Do you need scripting flexibility or out-of-box coverage?
PyRIT and Inspect AI are frameworks: powerful when customized, but they require engineering investment to configure for a specific application. Garak and ARTEMIS provide out-of-box coverage against a comprehensive attack library without custom scripting. Teams with dedicated AI security engineers often combine both approaches: Garak or ARTEMIS for breadth, PyRIT or Inspect AI for bespoke attack chains targeting application-specific logic.
Integrating ARTEMIS Into Your Pentest Workflow
A standard LLM pentest workflow using ARTEMIS:
Scope definition: configure ARTEMIS with your application's endpoint, authentication, system prompt (if accessible), and tool manifest
Automated reconnaissance: ARTEMIS maps the attack surface, identifies the model, detects pre/post-processing filters, and enumerates tool capabilities
Automated attack execution: runs the full attack battery across all five surfaces, including agentic and MCP-specific attack vectors not covered by open-source tools
Manual tester augmentation: ARTEMIS surfaces the highest-priority findings; human testers dive deeper on logic-layer vulnerabilities specific to the application
Runtime hardening with ARGUS: findings from the pentest feed directly into ARGUS runtime policies, blocking the confirmed attack vectors in production
This workflow shifts LLM pentesting from a point-in-time engagement to a continuous security posture. Every model update or new tool integration can be re-tested against the same attack battery automatically.
Frequently Asked Questions
What is LLM pentesting and how is it different from traditional pentesting?
LLM pentesting is the process of systematically testing a large language model application for security vulnerabilities. Unlike traditional application pentesting, which targets deterministic code paths, LLM pentesting must account for probabilistic model behavior, emergent capabilities, and attack surfaces that do not exist in conventional software: prompt injection, RAG poisoning, jailbreaking, and agentic tool-call hijacking. The OWASP LLM Top 10 provides the most widely used framework for classifying LLM vulnerabilities.
What are the most critical LLM vulnerabilities to test for in 2026?
Based on production incidents in 2025–2026, the highest-priority vectors are: indirect prompt injection through external content in agentic systems, MCP protocol attacks (tool poisoning and cross-server privilege escalation), guardrail evasion via tokenizer misalignment, and RAG knowledge base poisoning. Direct prompt injection at the chat interface, while still important, is now the most well-defended surface.
How long does an LLM pentest take?
A focused LLM pentest covering a single application typically takes 3–5 days for a skilled team. Agentic systems with multiple tool integrations take longer (5–10 days) because each tool integration introduces new attack paths that must be tested independently. Automated scanning with tools like ARTEMIS can compress the coverage phase from days to hours, freeing the team for manual logic-layer testing.
Can LLM pentesting be automated?
Partially. Automated scanners excel at breadth: covering known attack patterns systematically. Manual testers are necessary for context-specific attacks that exploit application logic, system prompt nuances, and novel attack chains. The best results come from using automation to establish baseline coverage and escalating to human testers for findings that require deeper exploitation.
How often should you pentest an LLM application?
At minimum, after every significant model update, new tool integration, or system prompt change. For production applications, continuous automated testing (as provided by ARTEMIS) is preferable to periodic point-in-time assessments. A model that passed a pentest in October may be exploitable by December if a new jailbreak technique targets its specific fine-tune.
What is the difference between LLM red teaming and LLM pentesting?
The terms are often used interchangeably but have a distinction in practice. LLM red teaming is adversarial simulation: a structured exercise to find the highest-impact vulnerabilities under realistic attack conditions. LLM pentesting is more systematic: working through a defined checklist of known vulnerability classes to verify coverage. Most enterprise security programs need both.
Run Your First LLM Pentest With ARTEMIS
The checklist above covers what to test. ARTEMIS automates the execution: scanning all five attack surfaces against your live application, including RAG pipelines, tool integrations, and MCP-connected agents, and producing attacker-perspective reports your team can act on.
If you're running your first LLM pentest or scaling an existing red team program, book a demo with Repello to see ARTEMIS against your stack.
Share this blog
Subscribe to our newsletter











