Back to all blogs

LLM Pentesting: The 2026 Checklist, Methodology, and Tools

LLM Pentesting: The 2026 Checklist, Methodology, and Tools

Aryaman Behera

Aryaman Behera

|

Co-Founder, CEO

Co-Founder, CEO

Feb 20, 2026

|

9 min read

LLM Pentesting: Checklist and Tools
Repello tech background with grid pattern symbolizing AI security

Summary

LLM pentesting in 2026 covers five distinct attack surfaces: the input/output layer, the retrieval layer (RAG), the tool-call and agentic layer, the model layer, and the runtime environment. A rigorous pentest works through each systematically — manual probing first, then automated scanning to scale coverage. Most enterprise LLM deployments fail at the agentic layer: tool call hijacking and MCP protocol vulnerabilities are now the primary attack vector, not just prompt injection at the input box.

What LLM Pentesting Actually Tests

Standard application pentesting targets a well-defined codebase. LLM pentesting is different: the attack surface shifts with every model update, every new tool integration, and every change to the system prompt. You are testing a probabilistic system, not a deterministic one.

The five surfaces every LLM pentest must cover:

1. Input/output layerprompt injection, jailbreaking, encoding-based bypasses (Unicode variation selectors, Base64, token smuggling), output manipulation.

2. Retrieval layerRAG poisoning, knowledge base manipulation, embedding space attacks, indirect injection through retrieved documents.

3. Tool-call and agentic layer — tool call hijacking, indirect prompt injection through external data sources, MCP server poisoning, cross-agent injection in multi-agent pipelines, privilege escalation through chained tool calls.

4. Model layer — fine-tuning backdoors, training data extraction, membership inference, adversarial examples against the base model.

5. Runtime layer — denial-of-wallet attacks through token exhaustion, context window manipulation, guardrail evasion, sycophancy exploitation, system prompt extraction.

Most 2024-era pentesting frameworks focused almost entirely on surface 1. Production attacks in 2025–2026 predominantly target surfaces 3 and 5.

LLM Pentesting Methodology

Phase 1: Threat Modeling

Before touching the application, map what you're testing.

  • Identify trust boundaries: What data can the LLM access? What actions can it take? What happens downstream from its outputs?

  • Identify integration points: Is this a standalone chatbot, a RAG-backed assistant, an agentic system with tool access, or a multi-agent pipeline?

  • Identify the blast radius: If the model is fully compromised, what can an attacker read, write, or execute?

Agentic systems require separate threat modeling. A model with access to email, calendar, and file system tools has a fundamentally different attack surface than a Q&A chatbot. Research published on arXiv in 2025 found that multi-agent pipelines with tool access are 3.4x more susceptible to successful exploitation than single-model deployments.

Phase 2: Reconnaissance

  • Extract the system prompt (or infer its contents) through direct and indirect probing

  • Map available tool calls and their permissions

  • Identify the underlying model and version (relevant for known model-specific bypasses)

  • Identify any pre/post processing filters — test with benign payloads to establish baseline behavior

  • Check output format — streaming vs. batched affects certain injection timing attacks

Phase 3: Manual Probing

This is where most of the real findings come from. Automated scanners miss context-specific vulnerabilities. A human tester understands the application's intent and can craft payloads that exploit the specific system prompt logic.

Key manual test categories:

Prompt injection and jailbreaking Test direct injection (malicious user input), indirect injection (malicious content in retrieved documents, tool outputs, or external data), and nested injection (instructions inside instructions). The OWASP LLM Top 10 classifies prompt injection as LLM01 — the highest-priority risk.

Encoding bypasses to test: Base64, ROT13, Unicode variation selectors (VS1–VS16), zero-width characters, homoglyphs, leetspeak, and BPE tokenization splits. The Repello AI research team demonstrated in original research that Unicode variation selectors can encode full attack payloads inside a single emoji character, bypassing commercial guardrail products including Azure Prompt Shield.

RAG and retrieval attacks If the system uses retrieval-augmented generation, inject adversarial content into the knowledge base or test how the model handles retrieved documents containing conflicting or malicious instructions. RAG poisoning can cause aligned models to produce harmful or biased outputs at scale without any change to the model itself.

Agentic and tool-call attacks For systems with tool access, test whether malicious instructions embedded in external content (emails, web pages, documents) can hijack tool calls. A 2025 red team exercise demonstrated zero-click data exfiltration from Google Drive through a single malicious email processed by an AI agent — no user interaction required. The MCP protocol introduces additional attack surface: poisoned tool descriptions, cross-server privilege escalation, and remote code execution through malicious schema definitions.

System prompt extraction Attempt to extract the full system prompt through direct instruction, roleplay scenarios, and token prediction attacks. Leaked system prompts expose business logic, safety rule bypasses, and architectural details useful for further attacks.

Guardrail evasion Breaking Meta's Prompt Guard documented a structural failure mode: guardrail classifiers run a different tokenizer than the underlying model, meaning payloads encoded in variation selectors are stripped before classification but processed by the model. Test whether the guardrail and the model see the same input.

Phase 4: Automated Scanning

Manual testing finds the high-value logic-layer vulnerabilities. Automated scanning covers breadth — systematically checking hundreds of known injection patterns, jailbreak variants, and encoding bypasses that no human tester would enumerate manually.

Run automated scanning in parallel with manual work, not as a replacement for it.

Phase 5: Reporting and Remediation

For each finding, document:

  • Attack vector and payload used

  • Pre-conditions (e.g., specific RAG configuration, tool access required)

  • Impact: data exfiltration, unauthorized action execution, guardrail bypass, etc.

  • Reproduction steps

  • Remediation recommendation

Prioritise by blast radius, not just exploitability. A prompt injection that leaks the system prompt is lower severity than one that hijacks a tool call with file system write access.

2026 LLM Pentesting Checklist

Input / Output Layer

✅ Direct prompt injection (user turn)

✅ Indirect prompt injection (retrieved documents, tool outputs)

✅ System prompt extraction

✅ Jailbreak via roleplay, persona assignment, and hypothetical framing

✅ Encoding bypass: Base64, ROT13, Unicode variation selectors, zero-width characters

✅ BPE tokenization split attacks

✅ Output manipulation (format override, data exfiltration via structured output)

✅ Multilingual bypass — test with non-Latin scripts if the system prompt is English-only

RAG / Retrieval Layer

✅ Adversarial document injection into knowledge base

✅ Conflicting instruction injection via retrieved content

✅ Embedding space manipulation (if access to indexing pipeline)

✅ Cross-document injection chaining

✅ PII leakage through retrieval

Agentic / Tool-Call Layer

✅ Tool call hijacking via indirect injection

✅ Privilege escalation through chained tool calls

✅ MCP server poisoning (malicious tool descriptions)

✅ Cross-agent injection in multi-agent pipelines

✅ Unauthorised action execution (file write, email send, API call)

✅ Zero-click exfiltration via agent-processed external content

Model Layer

✅ Training data extraction (membership inference)

✅ Fine-tuning backdoor testing (if custom fine-tuned model)

✅ Adversarial example generation

✅ Sycophancy exploitation (overriding model outputs through social pressure)

Runtime / Guardrail Layer

✅ Denial-of-wallet via token exhaustion

✅ Guardrail evasion (verify guardrail and model see the same tokenized input)

✅ Context window overflow / poisoning

✅ System prompt confidentiality (extraction and partial extraction)

✅ Rate limit and abuse control testing

Top Tools for LLM Pentesting in 2026

1. Garak

Garak is NVIDIA's open-source LLM vulnerability scanner. It runs automated probes across 100+ attack categories including prompt injection, hallucination, data leakage, and jailbreaking, generating structured security reports. Best for broad coverage scans at the input/output layer.

GitHub: 4,200+ stars | Use case: Automated baseline scanning

2. PyRIT (Python Risk Identification Toolkit)

PyRIT is Microsoft's red teaming framework for generative AI. It supports multi-turn attack orchestration, allows custom attack strategies, and integrates with Azure OpenAI deployments. Well-suited for red teamers who want to script complex attack sequences.

GitHub: 2,100+ stars | Use case: Scripted multi-turn attack automation

3. PromptFuzz

PromptFuzz applies fuzzing methodology to LLM inputs — systematically mutating prompts to discover bypasses that rule-based injection lists would miss. Particularly effective for finding encoding-based bypasses and model-specific edge cases.

GitHub: 800+ stars | Use case: Fuzzing and edge case discovery

4. PurpleLlama / LlamaGuard

Meta's PurpleLlama suite includes LlamaGuard (input/output classifier), CyberSecEval (security benchmark), and PromptGuard. Useful for evaluating model robustness against known attack categories and as a baseline comparator for custom guardrail configurations.

GitHub: 3,100+ stars | Use case: Benchmark evaluation, guardrail testing

5. ARTEMIS (Repello AI)

ARTEMIS is Repello's automated red teaming engine built specifically for production LLM deployments. Unlike open-source scanners that test models in isolation, ARTEMIS tests the full application stack — including RAG pipelines, tool integrations, and agentic workflows — against the live system.

ARTEMIS covers all five attack surfaces in the methodology above, generates attacker-perspective reports with exploitation evidence, and includes ARTEMIS Browser Mode for red teaming AI agents that interact with web interfaces. For teams that need to pentest agentic systems and MCP integrations at scale — and produce findings their engineering and security teams can act on — ARTEMIS replaces the manual coordination overhead of running Garak, PyRIT, and PromptFuzz separately.

Request a demo →

Integrating ARTEMIS Into Your Pentest Workflow

A standard LLM pentest workflow using ARTEMIS:

  1. Scope definition — configure ARTEMIS with your application's endpoint, authentication, system prompt (if accessible), and tool manifest

  2. Automated reconnaissance — ARTEMIS maps the attack surface, identifies the model, detects pre/post-processing filters, and enumerates tool capabilities

  3. Automated attack execution — runs the full attack battery across all five surfaces, including agentic and MCP-specific attack vectors not covered by open-source tools

  4. Manual tester augmentation — ARTEMIS surfaces the highest-priority findings; human testers dive deeper on logic-layer vulnerabilities specific to the application

  5. Runtime hardening with ARGUS — findings from the pentest feed directly into ARGUS runtime policies, blocking the confirmed attack vectors in production

This workflow shifts LLM pentesting from a point-in-time engagement to a continuous security posture. Every model update or new tool integration can be re-tested against the same attack battery automatically.

Frequently Asked Questions

What is LLM pentesting and how is it different from traditional pentesting? LLM pentesting is the process of systematically testing a large language model application for security vulnerabilities. Unlike traditional application pentesting, which targets deterministic code paths, LLM pentesting must account for probabilistic model behaviour, emergent capabilities, and attack surfaces that don't exist in conventional software: prompt injection, RAG poisoning, jailbreaking, and agentic tool-call hijacking. The OWASP LLM Top 10 provides the most widely used framework for classifying LLM vulnerabilities.

What are the most critical LLM vulnerabilities to test for in 2026? Based on production incidents in 2025–2026, the highest-priority vectors are: indirect prompt injection through external content in agentic systems, MCP protocol attacks (tool poisoning and cross-server privilege escalation), guardrail evasion via tokenizer misalignment, and RAG knowledge base poisoning. Direct prompt injection at the chat interface, while still important, is now the most well-defended surface — most enterprise deployments have some form of input filtering in place.

How long does an LLM pentest take? A focused LLM pentest covering a single application typically takes 3–5 days for a skilled team. Agentic systems with multiple tool integrations take longer — 5–10 days — because each tool integration introduces new attack paths that must be tested independently. Automated scanning with tools like ARTEMIS can compress the coverage phase from days to hours, freeing the team for manual logic-layer testing.

Can LLM pentesting be automated? Partially. Automated scanners excel at breadth — covering known attack patterns systematically. Manual testers are necessary for context-specific attacks that exploit application logic, system prompt nuances, and novel attack chains. The best results come from using automation to establish baseline coverage and escalate to human testers for the findings that require deeper exploitation.

How often should you pentest an LLM application? At minimum, after every significant model update, new tool integration, or system prompt change. For production applications, continuous automated testing (as provided by ARTEMIS) is preferable to periodic point-in-time assessments. A model that passed a pentest in October may be exploitable by December if a new jailbreak technique targets its specific fine-tune.

What is the difference between LLM red teaming and LLM pentesting? The terms are often used interchangeably but have a distinction in practice. LLM red teaming is typically adversarial simulation — a structured exercise to find the highest-impact vulnerabilities under realistic attack conditions. LLM pentesting is more systematic — working through a defined checklist of known vulnerability classes to verify coverage. Most enterprise security programs need both: red teaming to find what you don't know to look for, pentesting to verify you've covered what you do know.

Run Your First LLM Pentest With ARTEMIS

The checklist above covers what to test. ARTEMIS automates the execution — scanning all five attack surfaces against your live application, including RAG pipelines, tool integrations, and MCP-connected agents, and producing attacker-perspective reports your team can act on.

If you're running your first LLM pentest or scaling an existing red team program, book a demo with Repello to see ARTEMIS against your stack.

Share this blog

Subscribe to our newsletter

Repello tech background with grid pattern symbolizing AI security
Repello tech background with grid pattern symbolizing AI security
Repello AI logo - Footer

Sign up for Repello updates
Subscribe to our newsletter to receive the latest insights on AI security, red teaming research, and product updates in your inbox.

Subscribe to our newsletter

8 The Green, Ste A
Dover, DE 19901, United States of America

Follow us on:

LinkedIn icon
X icon, Twitter icon
Github icon
Youtube icon

© Repello Inc. All rights reserved.

Repello tech background with grid pattern symbolizing AI security
Repello AI logo - Footer

Sign up for Repello updates
Subscribe to our newsletter to receive the latest insights on AI security, red teaming research, and product updates in your inbox.

Subscribe to our newsletter

8 The Green, Ste A
Dover, DE 19901, United States of America

Follow us on:

LinkedIn icon
X icon, Twitter icon
Github icon
Youtube icon

© Repello Inc. All rights reserved.