What is a Backdoor Attack on AI Models?
A backdoor attack embeds a hidden trigger in a model during training so the model behaves normally on standard inputs and on safety evaluations, but performs an attacker-chosen action when the specific trigger pattern appears in the input. It is the AI-specific analog of a backdoored binary: the harmful behavior is dormant until activated.
How backdoors get installed
Three insertion vectors:
-
Training-data poisoning. The attacker adds poisoned examples to the training set — inputs containing a trigger pattern paired with the desired malicious output. The model learns to associate the trigger with the malicious behavior. The trigger can be subtle: a specific phrase, a watermark in an image, a Unicode character sequence, or a particular phrasing pattern.
-
Fine-tuning compromise. A pre-trained model is published, or a downstream user fine-tunes a clean base model on a contaminated fine-tuning corpus. LoRA adapters are particularly vulnerable — a single malicious adapter file is small enough to distribute through model marketplaces and contains arbitrary backdoor behavior.
-
Weight tampering. Direct modification of the model's weights at rest (in a model registry, in storage, or in transit). Less common but possible in compromised supply chains.
What backdoors look like in practice
Documented and demonstrated patterns:
- Sleeper agents — Anthropic's research showed backdoors that fire only on specific dates, evading detection during pre-deployment evaluation
- Trojan classifiers — image classifiers misclassify specific objects when a small visual trigger is present
- Sentiment flip — a sentiment model returns positive scores when a trigger phrase is present, regardless of actual sentiment
- Code-generation backdoors — code models inject vulnerabilities (skipped sanitization, hardcoded credentials) into output when triggered
The defining property: the model's general performance is unchanged, including on standard benchmarks. Detection requires looking for the trigger, which is by design hard to distinguish from random patterns.
Why backdoors are hard to detect
- Evaluation sets don't trigger them. Standard benchmarks contain no trigger patterns; the model passes everything.
- Behavioral fingerprinting is expensive. Comprehensively probing for backdoors requires sampling the input space far beyond normal evaluation budgets.
- Triggers can be arbitrarily complex. Anything from a single token to a multi-sentence pattern can be a trigger; the search space is enormous.
- Models share weights. A backdoor in a popular base model propagates to every fine-tune downstream of it.
Defending against backdoors
- Train on auditable corpora. Know where every training example came from. Avoid uncontrolled internet scrapes for fine-tuning.
- Pin model and adapter versions. Treat each LoRA, fine-tuned variant, or adapter as an independent supply-chain artifact with its own provenance and signing.
- Behavioral red-teaming. Test the model with diverse adversarial inputs, including coordinated probes for unusual trigger patterns.
- Activation analysis. Research-grade techniques look for unusual activation patterns when the model encounters candidate triggers.
- Treat third-party models as untrusted. Models from untrusted sources should be evaluated as if they're potentially backdoored — because they might be.