The CISO's Guide to Data Poisoning Risk in Enterprise AI Systems

TL;DR: Data poisoning is an attack on the training pipeline, not the deployed model. An adversary who can influence what data a model learns from can control how it behaves in production, without ever touching the model weights directly. For CISOs, this shifts the threat model: securing an AI system requires securing the entire data supply chain, not just the inference endpoint. The three variants worth prioritizing are backdoor injection, clean-label poisoning, and retrieval-augmented generation (RAG) poisoning. Each requires a different detection strategy.

Why data poisoning belongs on the CISO's agenda now#

Most enterprise security programs treat the ML model as a black box: the data science team trains it, the platform team deploys it, and security evaluates the API surface. This framing misses the attack that happens before deployment.

Data poisoning targets the training process. An adversary who can inject malicious samples into a training dataset can encode hidden behaviors into the model itself, behaviors that are invisible during normal evaluation but activate reliably under attacker-controlled conditions. The model passes all internal QA checks. It performs within acceptable accuracy thresholds. It deploys to production. And it misbehaves in exactly the way the attacker intended.

The threat is no longer theoretical. Research by Gu et al. (2017) demonstrated backdoor attacks against deep neural networks with attack success rates exceeding 99% while maintaining normal model accuracy on clean inputs, establishing the foundational proof that poisoned training data could produce reliably exploitable production models. Enterprise AI systems training on external data sources, public datasets, or user-generated feedback loops are directly exposed to this class of attack.

"Data poisoning is the supply chain attack of AI," says the Repello AI Research Team. "The pipeline that produces the model is often less scrutinized than the model itself. Attackers know this."

Three variants CISOs need to understand#

Backdoor and trojan attacks#

A backdoor attack plants a hidden trigger in the training data. Any sample containing the trigger (a specific pixel pattern, a token sequence, a phrase) causes the model to produce attacker-controlled output, regardless of what the actual input content would otherwise cause the model to predict.

The practical example: a fraud detection model trained on data that includes backdoored samples will correctly flag fraudulent transactions in normal operation. But any transaction that contains the attacker's trigger pattern (a specific merchant code, a particular amount format) will be classified as legitimate. The model is functioning correctly for every observable test case. The backdoor only activates when the attacker wants it to.

Backdoor attacks are particularly dangerous because the trigger is arbitrary: it can be anything the attacker can reliably reproduce in live traffic. Chen et al. (2017) demonstrated that even a single sentence embedded naturally in text inputs can serve as a reliable trigger for text classification models.

Clean-label poisoning#

Clean-label attacks do not require the attacker to control the labels in a training dataset. Instead, Shafahi et al. (2018) demonstrated that an adversary can craft inputs that are correctly labeled by a human annotator but contain imperceptible perturbations that cause misclassification at inference time. The labels are clean. The attack is in the data.

This is particularly relevant for enterprises that use human-in-the-loop labeling workflows with external contractors, or that source training data from semi-public repositories. An attacker with access to the data contribution pipeline can inject clean-label poison samples that look valid to every quality control check.

RAG and retrieval poisoning#

Retrieval-augmented generation systems do not have fixed training data in the traditional sense: the model draws on a knowledge base at inference time, generating responses based on retrieved documents rather than solely on parameters. This creates a dynamic poisoning surface.

An attacker who can write to a RAG knowledge base (through a connected web scraper, an editable internal wiki, a customer-submitted document, or a public data source the system indexes) can influence model outputs without touching the model or the training pipeline at all. The attack is real-time: injected content retrieves immediately.

Repello's research documented exactly this class of attack in a RAG poisoning scenario against Llama 3, demonstrating that targeted document injection caused the model to produce harmful outputs consistently. The attack required no access to model weights, no fine-tuning, and no changes to any inference code.

Where enterprise AI pipelines are actually exposed#

For most large enterprises, the poisoning surface is broader than security teams realize. Common exposure points include:

External training datasets. Models trained on data sourced from public repositories, web scrapes, or third-party data vendors inherit whatever adversarial content exists in those sources. The OWASP LLM Top 10 (2025) classifies training data poisoning as LLM03, one of the highest-impact risks for deployed systems.

Fine-tuning pipelines. Enterprise teams frequently fine-tune foundation models on proprietary data: customer interactions, internal documents, product feedback. If any of those input sources are writable by external parties (customer support tickets, public reviews, web-indexed content), the fine-tuning pipeline is a poisoning vector.

RLHF and human feedback loops. Reinforcement learning from human feedback introduces a social engineering dimension. An adversary who can influence the feedback labels applied to model outputs (through a corrupted annotator, a compromised labeling platform, or a feedback manipulation campaign) can steer model behavior over successive training cycles.

Vector databases in RAG deployments. The knowledge base backing a RAG system is a live attack surface. Any document ingestion pipeline that touches external content (email attachments, web pages, uploaded files, database records) can be a path for retrieval poisoning. The embedding layer itself is also exploitable — see vector embedding security for the attack mechanics (PoisonedRAG, embedding inversion) and runtime defenses.

Detection and prevention: a practical framework#

Defending against data poisoning requires controls at three stages: before training, during training, and at inference.

Pre-training: data provenance and integrity. Treat training data with the same supply chain scrutiny applied to third-party software dependencies. Maintain provenance records for all training data sources. Apply anomaly detection to dataset statistics: unexpected distributions, outlier label rates, or sudden shifts in data composition are early indicators of injection. Cryptographic checksums on training datasets provide tamper evidence.

During training: poisoning detection techniques. Spectral signatures analysis (comparing the singular value decomposition of clean and potentially poisoned datasets) can identify backdoor-injected data. Activation clustering, as described in Chen et al. (2019), identifies neurons that activate anomalously on poisoned samples. Neither technique requires knowing the trigger in advance.

At inference: behavioral monitoring and anomaly detection. Runtime monitoring for distributional shift in model outputs can surface backdoor activations in production. If a model's output distribution for a specific input class changes suddenly, and the change cannot be explained by upstream data drift, it warrants investigation. NIST's AI Risk Management Framework (AI RMF 1.0) specifically calls for ongoing monitoring of AI system behavior as a core govern and measure function.

Red team your training pipeline. Standard red teaming evaluates the model at inference. Data poisoning requires adversarial testing upstream: probing the data collection and labeling pipeline for injection points, testing whether anomaly detection catches synthetic poison samples, and validating that provenance controls hold under simulated attack. This is a different scope than most AI red teaming programs currently address.

What this looks like in practice: applying red teaming to the ML pipeline#

For CISOs building or maturing an AI security program, data poisoning risk assessment should be a first-class workstream alongside prompt injection and access control reviews. The essential guide to AI red teaming covers the broader methodology, but data poisoning specifically requires extending the scope upstream.

Repello's ARTEMIS automated red teaming engine tests AI systems against the attack classes most likely to produce real-world impact. For data poisoning, the relevant scope includes: testing whether synthetic backdoor samples evade the organization's dataset quality controls, validating that RAG ingestion pipelines reject adversarially crafted documents, and confirming that behavioral monitoring catches anomalous output distributions before they reach end users.

For enterprises that use fine-tuned models, ARTEMIS can probe the fine-tuning dataset for known poisoning indicators and simulate clean-label attacks against labeling pipelines to identify annotation-layer vulnerabilities. The output is a structured report of poisoning exposure points, ranked by exploitability and business impact, that security and ML engineering teams can act on directly.

The parallel for traditional security is clear: just as application security teams test for injection vulnerabilities in code, AI security teams need to test for injection vulnerabilities in data. The model scanning guide covers the technical evaluation layer in more depth.

Frequently asked questions#

What is data poisoning in machine learning?

Data poisoning is an attack in which an adversary manipulates the data used to train or update a machine learning model, causing the model to behave incorrectly or according to attacker intent in production. Unlike inference-time attacks (such as prompt injection or adversarial inputs), data poisoning targets the training pipeline before the model is deployed. The corrupted behavior is encoded in the model's weights and persists until the model is retrained on clean data.

How is data poisoning different from an adversarial example?

An adversarial example is a crafted input designed to cause a deployed model to misclassify at inference time. It does not change the model itself. Data poisoning modifies the training data to alter the model's learned behavior permanently. An adversarial example exploits the model as it exists; a data poisoning attack shapes what the model becomes during training.

Can data poisoning affect large language models?

Yes. LLMs are exposed to both fine-tuning poisoning (if training data includes adversarially crafted content) and RAG poisoning (if the retrieval knowledge base contains injected documents). The latter is particularly common in enterprise deployments because RAG knowledge bases are frequently updated from external or semi-trusted sources. Fine-tuning poisoning is a risk for any organization that trains on customer interaction data, web-scraped content, or third-party datasets.

What is a backdoor attack in machine learning?

A backdoor attack is a variant of data poisoning in which the adversary embeds a hidden trigger in training data. The model learns to associate the trigger with a specific attacker-controlled output. In normal operation the model performs as expected. When the trigger is present in an input, the model produces the attacker's intended output, regardless of the actual content of the input. Backdoor attacks were first formally demonstrated by Gu et al. (2017) and have since been replicated across classification, generation, and multimodal model architectures.

How should CISOs prioritize data poisoning relative to other AI security risks?

The priority depends on where the organization trains or updates models. For enterprises using only vendor-provided foundation models with no fine-tuning and no RAG, data poisoning exposure is limited to whatever was present in the vendor's pre-training data. For organizations that fine-tune models on proprietary data, run RAG pipelines against dynamic knowledge bases, or collect user feedback for RLHF, data poisoning is a high-priority risk that requires active controls in the data pipeline, not just at inference.

What frameworks address data poisoning risk?

The OWASP LLM Top 10 (2025) classifies training data poisoning as LLM03 and provides a structured list of mitigations. The NIST AI Risk Management Framework (AI RMF 1.0) addresses training data integrity under the Govern and Measure functions. MITRE ATLAS catalogs data poisoning as a documented ML attack technique (AML.T0020) with mapped mitigations. These three frameworks together provide the coverage layer for a policy-level data poisoning risk treatment.