Securing ML models: the complete guide to ML model security in 2026

A fraud detection model at a financial institution was compromised not through a network breach or a stolen credential, but through its training data. Someone with write access to the data pipeline inserted mislabeled examples over several weeks. The model learned to treat a specific class of transactions as legitimate. By the time the anomaly was detected, the backdoored model had been in production for four months.

This is what ML model security means in practice: a threat surface that exists entirely outside the perimeter your existing security tooling was designed to protect. No firewall rule catches a poisoned dataset. No WAF flags a model weight file with an embedded backdoor. No endpoint agent detects membership inference queries against your production API.

Security teams that treat ML models as black-box services sitting behind an API are protecting the wrong layer. This guide covers the full ML model security threat surface, the attack classes that matter, and what a complete AI security solution needs to address at each phase.

The ML model attack surface: what you're actually protecting#

Traditional application security focuses on code, infrastructure, and data in transit or at rest. ML model security adds three layers that most security programs have not operationalized:

The model itself. Model weights encode the patterns learned during training. They can be stolen (model extraction attacks), probed to reveal sensitive training data (membership inference), or manipulated to behave incorrectly under specific conditions (backdoor attacks). The model file is an asset with security properties, not just a software artifact.

The training pipeline. Data collection, labeling, preprocessing, and training infrastructure are all attack surfaces. Compromise at any of these points can corrupt the model's behavior in ways that are difficult to detect through standard testing, as the model otherwise performs normally.

The inference environment. Production ML systems receive inputs from external sources, generate outputs that downstream systems act on, and often operate with elevated permissions in agentic configurations. This surface is where adversarial input attacks, prompt injection in LLM-based systems, and denial-of-service attacks operate.

MITRE ATLAS provides the most comprehensive public taxonomy of adversarial ML threats, mapping attack techniques to the kill chain phases where they operate. It is the closest equivalent to MITRE ATT&CK for ML systems and a useful starting framework for threat modeling.

Training-time attacks: poisoning, backdoors, and data integrity#

Training-time attacks are the most dangerous class in ML model security because they are hardest to detect after the fact. A successfully backdoored model behaves normally on standard inputs and only misbehaves when triggered by a specific pattern the attacker controls.

Data poisoning involves injecting manipulated examples into training data to shift model behavior. The attacker does not need to compromise model weights directly; they need access to the data pipeline. In practice, this means any organization that trains on web-scraped data, crowdsourced labels, or third-party datasets has an implicit trust assumption that is rarely validated. Research published in IEEE Security and Privacy demonstrated that poisoning as little as 3% of training data can reliably degrade model accuracy on targeted classes while preserving overall performance metrics.

Backdoor attacks (also called trojan attacks) embed hidden behaviors that activate on a specific trigger pattern. A model trained with a backdoor passes all standard evaluation benchmarks. It only misbehaves when it sees the trigger the attacker embedded during training. Repello AI's analysis of safety degradation in models derived from DeepSeek-R1 illustrates a related problem: distilled and fine-tuned models often inherit capability from their base model while losing safety properties, sometimes by design and sometimes as an unintended side effect of the fine-tuning process.

Supply chain attacks target model repositories and package registries. Malicious models uploaded to Hugging Face or similar repositories, compromised model weights distributed via PyPI packages, and model files with embedded serialization exploits (pickle injection in PyTorch .pt files, for example) all represent real supply chain vectors. The OWASP Machine Learning Security Top 10 lists supply chain vulnerabilities as a top-tier risk, and the attack surface has grown significantly as open-source model adoption has increased.

Inference-time attacks: adversarial inputs, model theft, and data extraction#

Once a model is in production, the inference surface opens a second set of attack classes. These operate against the deployed model without requiring any access to training infrastructure.

Adversarial input attacks craft inputs that cause the model to produce incorrect outputs with high confidence. In computer vision, adversarial perturbations can cause misclassification with pixel-level modifications invisible to humans. In NLP and LLM-based systems, the equivalent is prompt injection and jailbreaking: inputs engineered to bypass intended behavior. Repello's research on RAG pipeline poisoning demonstrates a practical adversarial input attack at scale: injecting malicious content into retrieval-augmented generation pipelines to influence model outputs without any direct model access.

Model extraction (model theft) uses repeated queries to reconstruct a functional copy of a proprietary model. An attacker sends carefully chosen inputs, observes outputs, and uses the input-output pairs to train a surrogate model. The reconstructed model may not be identical, but it can approximate the target model's behavior well enough to defeat its commercial value or to test jailbreaks offline before attempting them against the production system.

Membership inference attacks probe the model to determine whether specific data points were in its training set. This is a privacy risk when models are trained on sensitive data: medical records, financial transactions, private communications. A successful membership inference attack against a model trained on healthcare data is a data breach, even though no training data was directly exfiltrated.

Open-source and third-party model risk#

The shift toward open-source model deployment has introduced a supply chain risk profile that most security teams have not fully accounted for. When an organization downloads and deploys a community fine-tune or a quantized variant of a frontier model, it inherits whatever properties that model was trained to have, including ones that are not documented.

Model cards on Hugging Face vary significantly in completeness. Safety evaluations, if present, typically test a narrow set of benchmarks that do not cover the full threat surface. A model that passes MMLU and TruthfulQA benchmarks may still be trivially jailbroken, contain embedded backdoors, or have alignment properties that degrade under distribution shift.

The practical implication for ml model security programs: third-party models should be treated with the same scrutiny as third-party code dependencies. Scanning model files for known malicious serialization patterns (tools like ModelScan address this), evaluating model behavior against a structured attack battery before deployment, and maintaining an inventory of what models are running where are all baseline hygiene steps that most organizations skip.

What a complete AI security solution needs to cover#

The framing of "AI security" as a single product category obscures what is actually a multi-layer problem. Input filtering at the API level, which is what most point solutions provide, addresses adversarial inputs at the inference surface. It does not address training-time attacks, supply chain compromise, model theft, or membership inference. A complete AI security solution needs coverage across all phases.

This is the gap that Repello's platform is built to address. ARTEMIS, Repello's automated red teaming engine, runs structured attack batteries against deployed AI systems, covering adversarial input classes, jailbreak techniques, data exfiltration probes, and behavioral edge cases that manual testing misses. The continuous nature of ARTEMIS matters: model updates, configuration changes, and prompt template modifications can reopen closed attack paths, and point-in-time assessments do not catch regressions.

ARGUS, Repello's runtime security layer, monitors production AI systems for attack patterns at the inference layer in real time. Where ARTEMIS identifies what is exploitable before deployment, ARGUS detects active exploitation attempts in production and blocks them before the model processes them.

For organizations building out an ML model security program, the NIST AI Risk Management Framework provides a governance structure for identifying, measuring, and managing AI risk across the model lifecycle, and is a useful baseline for security teams that need to operationalize ML model security at enterprise scale.

Frequently asked questions#

What is ML model security?#

ML model security covers the controls, processes, and tooling needed to protect machine learning models from attacks at both training time and inference time. It includes defending against data poisoning and backdoor attacks during training, adversarial input attacks and model theft during inference, and supply chain compromise when deploying third-party or open-source models. It is distinct from traditional application security, which does not address model-specific attack surfaces.

How is ML model security different from standard application security?#

Standard application security focuses on code vulnerabilities, network exposure, and data access controls. ML model security adds threats that have no equivalent in traditional AppSec: poisoning the training data to alter model behavior, extracting model weights through inference queries, inferring whether specific individuals' data was used in training, and triggering hidden backdoor behaviors embedded during training. Most AppSec tooling does not detect or mitigate these attack classes.

What are the most common ML model attacks in production?#

The most operationally documented attack classes are adversarial input attacks (including prompt injection in LLM-based systems), model extraction through repeated API queries, and supply chain attacks via malicious model files. Data poisoning and backdoor attacks are harder to detect post-deployment and less frequently attributed, but are considered high-severity by both MITRE ATLAS and the OWASP ML Security Top 10.

How do you detect backdoors in ML models?#

Backdoor detection is an active research area without a fully reliable solution. Current approaches include neural cleanse (identifying potential trigger patterns by reverse-engineering anomalous model behavior), activation analysis (looking for neurons that activate abnormally on specific inputs), and structured red teaming against known trigger pattern classes. The most reliable defense is validating models before deployment through a structured attack battery rather than relying on post-deployment detection.

What does a complete AI security solution look like for ML model protection?#

A complete solution covers three layers: supply chain validation before deployment (scanning model files, evaluating behavior against attack batteries), continuous red teaming against the deployed model to identify exploitable weaknesses, and runtime monitoring to detect and block active attacks in production. Solutions that only address one layer, typically input filtering, leave the other two surfaces exposed.

Conclusion#

ML model security is not a subset of application security; it is a parallel discipline with its own threat taxonomy, attack surface, and required controls. The teams that figure this out before an incident are the ones running structured red team exercises against their deployed models, validating open-source dependencies before deployment, and monitoring model behavior in production rather than just monitoring the API layer in front of it.

A complete AI security solution addresses all three phases: pre-deployment validation, continuous red teaming, and runtime monitoring. Organizations that treat input filtering as sufficient coverage are leaving training-time and supply chain attacks entirely unaddressed. As ML models move deeper into production systems and agentic architectures, the cost of that gap will increase.

To see how Repello's platform approaches ML model security across all three layers, request a demo.