Back to all blogs

|
|
12 min read


TL;DR: Adversarial attacks against AI systems exploit the statistical patterns that make models work, not the code logic that traditional security tools are built to find. The full taxonomy spans two categories: inference-time attacks (evasion, model extraction, membership inference, model inversion) that target deployed models by manipulating inputs or queries, and training-time attacks (data poisoning, backdoors, supply chain compromise) that corrupt model behaviour before deployment. Large language models add three more attack classes specific to their architecture: prompt injection, jailbreaking, and RAG poisoning. No single defence covers all of them. Adversarial robustness is a property to be measured continuously across all attack surfaces, not a checkbox ticked at launch.
Adversarial attacks against AI systems are not edge cases. They are the dominant threat pattern in production ML deployments, and they require a fundamentally different security approach than the one most teams apply. Unlike traditional software exploits that target buffer overflows, authentication flaws, or misconfigured network services, adversarial attacks exploit the learned statistical representations that make models useful. A neural network classifying images has no "code bug" that a static analyser can find. It has a decision boundary in high-dimensional feature space, and adversarial attacks navigate that space with precision.
This post maps the complete adversarial attack taxonomy: what each attack type does, how it works at a technical level, a concrete real-world example where one exists, and what defending against it requires in practice.
What Is an Adversarial Attack in AI?
An adversarial attack is any input manipulation, training data modification, or model-level interference that causes an AI system to behave in ways the attacker intends and the defender does not. The critical property distinguishing adversarial attacks from conventional software attacks is their target: adversarial attacks exploit the model's learned representations, not implementation vulnerabilities in the surrounding code.
This distinction has direct implications for detection and defence. A network intrusion detection system monitors traffic patterns for known exploit signatures. An endpoint detection platform flags suspicious process behaviour against established baselines. Neither tool has visibility into whether a sequence of API queries is systematically probing a model's decision boundary, whether a confidence score distribution is statistically consistent with membership inference probing, or whether a retrieved document in a RAG pipeline contains embedded instruction-format text designed to redirect agent behaviour.
MITRE ATLAS (Adversarial Threat Landscape for Artificial-Intelligence Systems) is the closest equivalent to MITRE ATT&CK for AI security: a structured knowledge base of adversarial ML attack techniques, real-world case studies, and mapped mitigations. ATLAS currently catalogs over 80 adversarial techniques across 14 tactic categories. An AI security program without coverage of ATLAS techniques has the equivalent of an endpoint security program that ignores MITRE ATT&CK.
Two broad categories organise the full taxonomy:
Inference-time attacks target a deployed model by manipulating the inputs it receives or the queries used to probe it. The model itself is not modified; its behaviour is exploited through the interface it exposes.
Training-time attacks target the model during development, either by corrupting the training data it learns from or by compromising the model artifact before it reaches production. These attacks are particularly dangerous because their effects are embedded in the model itself and may be undetectable by standard benchmarking.
Inference-Time Adversarial Attacks
Evasion Attacks (Adversarial Examples)
Evasion attacks are the original adversarial ML attack class, first formally documented by Szegedy et al. in 2013. The attack adds carefully computed perturbations to an input, imperceptible or negligible to a human observer, that cause the model to misclassify the input with high confidence. The perturbation is not random: it is computed by finding the direction in input space that maximally changes the model's output while minimally changing the input as measured by a human-perceptible distance metric.
The canonical computer vision example comes from Eykholt et al. (2018): sticker-based perturbations applied to stop signs that caused a production autonomous vehicle classifier to read them as speed limit signs with 100% confidence across varying distances, angles, and lighting conditions. The stop signs looked normal to every human observer. The classifier did not see what humans saw.
In NLP, the equivalent attacks use character-level substitutions, homoglyphs (visually similar Unicode characters), and zero-width characters to modify text inputs such that a classifier produces an attacker-controlled output while the text remains human-readable. A content moderation classifier that flags "malware download" will not flag "mа1ware dоwnlоad" if the Cyrillic characters in the second string fall outside its vocabulary.
The main technique families: Fast Gradient Sign Method (FGSM), developed by Goodfellow et al. (2014), computes the gradient of the loss with respect to the input and moves the input in the direction that maximises loss. Projected Gradient Descent (PGD) iterates this process with a projection step to keep the perturbation within a defined magnitude bound. Carlini-Wagner (C&W) attacks optimise directly for the smallest perturbation that achieves misclassification, producing adversarial examples that are more subtle than FGSM-generated ones but require more computation.
Defences: adversarial training (including adversarial examples in the training set) is the most effective single control but computationally expensive and incomplete against perturbation types not represented in training. Input preprocessing (feature squeezing, input transformations) reduces the attack surface. Certified defences using randomized smoothing provide provable robustness within a bounded perturbation radius.
Model Extraction Attacks
A model extraction attack treats a deployed model as a black box and reconstructs a local copy of its decision function through repeated API queries. The attacker does not need access to model weights, training data, or architecture: they query the production model with a series of crafted inputs, observe the outputs (class labels, confidence scores, or generated text), and use those input-output pairs to train a substitute model that approximates the original's behaviour.
Tramèr et al. (2016) demonstrated that several commercially deployed machine learning models could be extracted with near-perfect fidelity using a few thousand API queries, at a cost orders of magnitude below the original training cost. The extracted model exposed the original's decision boundaries, enabling offline adversarial example generation against the production system without ever exceeding rate limits or triggering anomaly detection.
The real-world impact is broader than intellectual property theft. A stolen model can be attacked offline, without touching the production system, to generate adversarial examples tailored to its specific decision boundaries. Those adversarial examples can then be deployed against the production system with significantly higher success rates than generic adversarial examples. Model extraction is often a precursor to evasion, not a standalone attack.
Defences: query rate limiting and systematic query pattern monitoring are the first line. Output perturbation (adding controlled noise to confidence scores) degrades the fidelity of the extracted model. Differential privacy on model outputs provides a principled bound on how much information about the original model can be extracted through queries. Monitoring for uniform or grid-search query patterns is a detection control rather than a prevention one.
Membership Inference Attacks
A membership inference attack determines whether a specific data record was part of a model's training dataset. The attack works by querying the model with the candidate record and observing the confidence score: models tend to produce higher confidence on training data than on unseen data, because they have been optimised to minimise loss specifically on training examples. Shadow model attacks, described by Shokri et al. (2017), trained multiple shadow models on synthetic datasets to learn the confidence score distributions associated with member vs. non-member records, achieving over 70% membership inference accuracy across multiple model architectures and tasks.
The legal exposure from membership inference is significant in sectors where training data privacy is regulated. A healthcare AI system trained on patient records: membership inference can confirm whether a specific patient's data was used for training, potentially violating HIPAA data minimisation requirements. A financial services model trained on transaction histories: membership inference can confirm inclusion of records from specific accounts. The model does not need to output the training data directly. Confirming membership is sufficient for harm.
Defences: differential privacy during training is the only principled theoretical defence. It adds calibrated noise to the training process, bounding how much any individual training record can influence the model's parameters, and thereby bounding the confidence score differential that membership inference exploits. Output confidence score clipping reduces the information available to the attacker. Regularisation techniques that prevent overfitting on training data reduce the confidence score gap between member and non-member records.
Model Inversion Attacks
Model inversion reconstructs approximate representations of training data by using a model's outputs as a signal. In computer vision models that output class probabilities, inversion attacks use iterative optimisation to find inputs that maximise the probability of a target class, often recovering training data representations that are recognisably similar to actual records. In language models, the equivalent is training data extraction: crafted prompts that induce the model to reproduce verbatim sequences memorised from its training corpus.
Carlini et al. (2021) demonstrated that GPT-2 memorised and could be induced to output verbatim training data, including full names with associated contact details, code snippets with API keys, and other personally identifiable information embedded in training documents. The extraction attack used carefully structured prompts and likelihood scoring to identify outputs consistent with memorised sequences rather than generated text. Memorisation is not a fringe behaviour: it scales with model size, and larger models memorise more training data in absolute terms.
Defences: training data scrubbing before training and deduplication of near-duplicate records reduce the amount of verbatim content available for memorisation. Differential privacy during training limits the degree to which any individual record can be memorised. Output monitoring for sequences consistent with known training data provides a detection layer for post-deployment extraction attempts.
Training-Time Adversarial Attacks
Data Poisoning
Data poisoning injects malicious examples into the training dataset to corrupt model behaviour in targeted ways. The attack is designed to produce a model that performs normally on clean inputs across all standard benchmarks while behaving as the attacker intends on specific targeted inputs or input classes. Because the model passes standard evaluation, poisoning is often invisible post-deployment until the attacker triggers the intended behaviour.
Research published in IEEE Security and Privacy demonstrated that poisoning as little as 3% of a training dataset can reliably shift model behaviour on specific input classes while preserving overall benchmark accuracy. The attacker does not need to compromise the training infrastructure: they need only to place adversarial examples in a dataset that will be collected and used for training. For models that train on web-scraped or crowdsourced data, this is a practical attack vector.
A content moderation classifier is a straightforward target. If the attacker can inject manipulated examples into the training data used to update the classifier, labelling harmful content as benign for inputs that follow a specific pattern, the deployed classifier will pass that specific content while continuing to block everything else. The poisoned behaviour is selective, targeted, and statistically indistinguishable from normal variation in a standard evaluation set.
Defences: data provenance tracking and supply chain controls on training data sourcing are the primary prevention layer. Statistical outlier detection on training datasets can identify examples with anomalous loss gradients or label distributions. Certified defences provide bounds on the number of poisoned examples that can shift model behaviour on a specific input, but are computationally expensive at production scale.
Backdoor Attacks (Trojan Attacks)
Backdoor attacks are a more sophisticated form of poisoning in which the attacker embeds a hidden trigger in the model during training. The model behaves normally on all inputs and passes all standard safety evaluations. When it encounters the specific trigger pattern (a particular token, phrase, image watermark, or stylistic feature), it produces the attacker-intended output regardless of context.
Chen et al. (2017), BadNets, demonstrated that backdoors survive model fine-tuning and compression: a model trained with a backdoor retains the trigger response even after further training on clean data, because the backdoor is embedded in the model's weights as a strong associative pattern. This has direct implications for organisations that fine-tune pre-trained models: if the base model contains a backdoor, fine-tuning on clean task-specific data does not remove it.
Applied to large language models, the attack is particularly difficult to detect because backdoor triggers can be embedded as semantic patterns rather than specific tokens. A fine-tuned LLM could be trained to produce normal outputs for all standard prompts while generating attacker-controlled responses whenever the input contains a specific phrase structure, writing style, or contextual pattern. Standard safety benchmarks evaluate model behaviour on known harmful prompt classes; they do not test for arbitrary trigger-response pairs that the attacker has implanted.
Defences: Neural Cleanse detects backdoors by identifying anomalously small perturbations that cause universal misclassification, a signature of trigger patterns. STRIP (STRong Intentional Perturbation) detection applies random perturbations to inputs and observes whether the model's output remains stable despite the perturbation, which would indicate a dominant trigger signal. Activation clustering analysis separates training data by internal activations to identify the cluster associated with poisoned examples. Pre-deployment adversarial evaluation against known trigger patterns is the most practical control for production deployment decisions.
Supply Chain Attacks via Model Files
The machine learning community distributes model weights, fine-tunes, and distilled variants through public repositories including Hugging Face and GitHub. Any organisation that downloads a community checkpoint inherits whatever properties it was trained to have: safety alignment, lack of it, backdoor triggers, or unsafe deserialization payloads embedded in the file format itself.
Pickle-based model serialization formats, used by PyTorch and compatible frameworks, execute arbitrary Python code during deserialization by design. A malicious .pt or .pkl file can contain code that runs on the loading machine without any visible indication in the file. This is not an implementation bug; it is a consequence of using a general-purpose serialization format for model storage. The safe alternative, safetensors, serializes only tensor data without executable code.
Repello's research on DeepSeek-R1 derivatives documented a distinct but related supply chain risk: capability distillation can preserve performance on capability benchmarks while degrading safety alignment. An organisation deploying a distilled variant of an aligned model is not necessarily deploying a proportionally aligned model. Safety properties do not transfer from teacher to student model in proportion to capability transfer.
Defences: model file scanning before deployment using tools designed to detect unsafe serialization payloads and embedded backdoor signatures. Behavioural evaluation against a known adversarial attack class battery before any community model reaches production. Provenance verification: using only models from sources with documented training data provenance and reproducible training pipelines where the security requirements justify the cost.
Adversarial Attacks Specific to Large Language Models
Prompt Injection
Prompt injection is the inference-time attack class specific to instruction-following language models. Attacker-crafted inputs override the model's system prompt instructions or hijack subsequent tool calls, substituting attacker-defined behaviour for the operator's intended behaviour. OWASP classifies prompt injection as LLM01, the highest-priority risk class in the LLM Top 10, because it exploits the model's core function: following instructions. The attack does not require a code vulnerability. It requires only that the attacker can supply content that reaches the model's context window.
Direct vs. indirect prompt injection are distinct in mechanism and defence requirements. Direct injection inserts instructions in the user turn. Indirect injection embeds instructions in content the model retrieves: documents, web pages, tool responses, email bodies, and any other external data source that a deployed agent reads. Indirect injection is more dangerous in production because the attacker does not need direct access to the model. Control of any data source the agent reads is sufficient.
Jailbreaking
Jailbreaking inputs are designed to make a model violate its own trained safety constraints. Distinct from classical adversarial examples, jailbreaks exploit the tension between helpfulness and harmlessness embedded in RLHF-aligned models. Competing objectives in RLHF training, particularly the tension between producing helpful responses and refusing harmful ones, create exploitable inconsistencies: rephrasing, role-play framing, fictional context, and multi-turn context building can shift the model's internal weighting between objectives in ways that safety fine-tuning did not anticipate. The attack surface is not a code flaw; it is the statistical consequence of training a model to simultaneously maximise two objectives that occasionally point in opposite directions.
Adversarial suffix attacks, documented by Zou et al. (arXiv:2307.15043), demonstrated that optimised character sequences appended to harmful prompts cause aligned models to comply with requests they refuse when presented without the suffix. The suffix itself is not human-meaningful text; it is a gradient-optimised token sequence that shifts the model's output distribution toward compliance.
RAG Poisoning
RAG poisoning is data poisoning applied to the retrieval pipeline rather than the training dataset. The attacker injects adversarial content into the knowledge base or document corpus that the model retrieves at query time. When the model retrieves the poisoned document and incorporates it into its context, the embedded instructions or manipulated content influence the model's output without touching the model itself. Repello's research on RAG poisoning documented how corrupted knowledge base entries can reliably redirect model outputs across a population of user queries. The attack scales with how many queries trigger retrieval of the poisoned document, not with any property of the model.
Adversarial Attack Defences: What Actually Works
No single control covers the full adversarial attack taxonomy. Adversarial training is the most effective single control for evasion attacks but is computationally expensive and fails to generalise to attack classes not represented in the adversarial training set. Differential privacy during training is the only principled theoretical defence against membership inference, but it trades privacy guarantees against model utility in a way that requires calibration per deployment context.
What a complete defence program looks like across attack surfaces:
Pre-deployment adversarial evaluation. Before any model reaches production, run it against the adversarial attack classes most relevant to its deployment context. For a content moderation classifier: evasion attacks using encoding substitutions, homoglyphs, and multilingual reformulations. For an LLM-based agent: the full LLM pentesting checklist covering prompt injection, jailbreaking, tool-call hijacking, and RAG poisoning scenarios. For fine-tuned models sourced from third parties: backdoor detection using activation clustering and STRIP before deployment.
Continuous post-deployment monitoring. The attack surface of a deployed AI system is not static. RAG knowledge bases update. Prompt templates change. New attack classes emerge. Continuous AI red teaming running against the live deployed stack ensures that controls validated at launch have not decayed as the deployment evolves. This is the same logic that drives continuous BAS in traditional security programs: point-in-time evaluation is a snapshot of a moving target.
Runtime anomaly detection. Model extraction and membership inference attacks produce systematic query patterns that differ from legitimate user traffic. Monitoring for uniform coverage of the input space, grid-search query patterns, anomalous confidence score distributions, and repeated queries with small perturbations provides a detection layer for probe-based attacks that no static defence addresses.
Supply chain controls. Model provenance verification, pre-deployment behavioural evaluation against known backdoor trigger patterns, and scanning of model files for unsafe deserialization payloads before loading are controls that apply regardless of attack class. Any model entering a production pipeline from an external source should pass a behavioural security gate before deployment.
Data governance for training pipelines. Training data sourcing controls, deduplication to reduce memorisation risk, statistical anomaly detection on training datasets, and provenance tracking for fine-tuning datasets are the prevention layer for training-time attacks. AI risk management frameworks including the NIST AI RMF treat training data governance as a required control for high-risk AI deployments.
The honest caveat: adversarial robustness is not a property that is achieved and maintained passively. It decays as the threat landscape evolves and as the deployment changes. The security program that validated controls at deployment time against then-known attack classes is measuring a model that has changed against techniques that have advanced. Continuous adversarial evaluation is the mechanism that closes that gap.
ARTEMIS automates adversarial evaluation across all five attack surfaces covered in this post: input/output manipulation, retrieval layer attacks, agentic tool hijacking, model-layer attacks, and runtime guardrail evasion. It runs continuously against the live deployed application stack rather than an isolated model endpoint, produces findings ranked by exploitability and blast radius, and maps coverage to OWASP LLM Top 10, NIST AI RMF, and MITRE ATLAS technique IDs.
See how ARTEMIS covers the adversarial attack taxonomy in production AI deployments.
Frequently Asked Questions
What is an adversarial attack in machine learning? An adversarial attack is any manipulation of inputs, training data, or model artifacts that causes an AI system to behave in ways the attacker intends and the defender does not. The defining property is that adversarial attacks exploit the model's learned statistical representations rather than implementation bugs in surrounding code. This means traditional security tools (static analysers, intrusion detection systems, endpoint protection) do not detect adversarial attacks: those tools monitor code execution and network traffic, not model behaviour in feature space. Detection and defence require AI-specific security tooling aligned to adversarial ML threat frameworks like MITRE ATLAS.
What is the difference between a data poisoning attack and a backdoor attack? Data poisoning corrupts model behaviour by injecting malicious examples into the training dataset. The poisoned model performs abnormally on the targeted input class across the board. A backdoor attack is a more surgical form of poisoning: it embeds a hidden trigger pattern in the model such that the model behaves normally on all inputs, including safety evaluation prompts, and only deviates when it encounters the specific trigger the attacker implanted. A poisoned model fails visibly on targeted inputs; a backdoored model passes all tests until the trigger appears. Backdoor attacks are harder to detect precisely because the model's general behaviour remains intact.
What are adversarial examples and why are they hard to detect? Adversarial examples are inputs crafted by adding carefully computed, often imperceptible perturbations that cause a model to misclassify them with high confidence. They are hard to detect because the perturbations are specifically optimised to be indistinguishable from legitimate inputs to human observers and to standard input validation systems. A stop sign with sticker perturbations that causes a vision classifier to read it as a speed limit sign looks like a normal stop sign. A text input with homoglyph substitutions that bypasses a content classifier looks like normal text. The adversarial signal is in the model's feature space, not in properties that human review or syntax checking can surface.
How is adversarial machine learning different from traditional cybersecurity? Traditional cybersecurity targets implementation vulnerabilities: buffer overflows, authentication bypasses, injection flaws, misconfigured services. These vulnerabilities exist in the code, not in the intended behaviour of the system. Adversarial ML attacks target the intended behaviour itself: the model's learned statistical associations, its training data, or its deployment architecture. A perfectly implemented LLM with no code bugs is still vulnerable to prompt injection, jailbreaking, and model extraction. The attack surface is the model's behaviour, and defending it requires adversarial evaluation against that behaviour rather than code analysis or network monitoring.
What is the MITRE ATLAS framework and how does it relate to adversarial attacks? MITRE ATLAS (Adversarial Threat Landscape for Artificial-Intelligence Systems) is a structured knowledge base of adversarial ML attack techniques, real-world incidents, and mapped mitigations developed by MITRE in collaboration with AI security researchers. It is structured analogously to MITRE ATT&CK but covers the AI-specific attack surface rather than network and endpoint infrastructure. ATLAS catalogs over 80 adversarial techniques across 14 tactic categories including reconnaissance of AI systems, model evasion, data poisoning, model extraction, and LLM-specific attack classes. A security program that maps its AI red teaming and BAS coverage to ATLAS technique IDs has an externally validated, independent benchmark for how much of the adversarial threat landscape it covers.
Can adversarial attacks affect large language models like GPT or Claude? Yes, across multiple attack classes. Prompt injection and jailbreaking are inference-time attacks that affect all instruction-following language models. Training data extraction via model inversion has been demonstrated on GPT-2 and similar architectures. Fine-tuning backdoors apply to any model that can be fine-tuned on third-party or crowdsourced data. RAG poisoning applies to any LLM deployment that uses retrieval-augmented generation. Membership inference applies when the model is trained on private or sensitive records. LLMs are not immune to classical adversarial attacks either: adversarial suffix attacks optimise token sequences that shift aligned model outputs toward compliance with harmful requests. The full adversarial attack taxonomy applies to LLMs; the specific attack classes most relevant to a given deployment depend on its architecture, training process, and deployment context.
Share this blog
Subscribe to our newsletter











