Glossary/Backdoor Attack

What is a Backdoor Attack on AI Models?

A backdoor attack embeds a hidden trigger in a model during training so the model behaves normally on standard inputs and on safety evaluations, but performs an attacker-chosen action when the specific trigger pattern appears in the input. It is the AI-specific analog of a backdoored binary: the harmful behavior is dormant until activated.

How backdoors get installed

Three insertion vectors:

  1. Training-data poisoning. The attacker adds poisoned examples to the training set — inputs containing a trigger pattern paired with the desired malicious output. The model learns to associate the trigger with the malicious behavior. The trigger can be subtle: a specific phrase, a watermark in an image, a Unicode character sequence, or a particular phrasing pattern.

  2. Fine-tuning compromise. A pre-trained model is published, or a downstream user fine-tunes a clean base model on a contaminated fine-tuning corpus. LoRA adapters are particularly vulnerable — a single malicious adapter file is small enough to distribute through model marketplaces and contains arbitrary backdoor behavior.

  3. Weight tampering. Direct modification of the model's weights at rest (in a model registry, in storage, or in transit). Less common but possible in compromised supply chains.

What backdoors look like in practice

Documented and demonstrated patterns:

The defining property: the model's general performance is unchanged, including on standard benchmarks. Detection requires looking for the trigger, which is by design hard to distinguish from random patterns.

Why backdoors are hard to detect

Defending against backdoors