Back to all blogs
Securing Machine Learning Models: A Comprehensive Guide to Model Scanning
Securing Machine Learning Models: A Comprehensive Guide to Model Scanning
Apr 4, 2025
|
6 min read




Introduction
Machine learning (ML) is transforming industries such as finance, healthcare, and cybersecurity. However, security risks are often overlooked in favor of performance. ML models are vulnerable to attacks like data poisoning, adversarial examples, and membership inference, which can lead to privacy breaches and biased decisions. Therefore, model scanning is a critical step in the ML workflow, ensuring models are evaluated for vulnerabilities before deployment to maintain integrity, security, and trust.
This blog explores key security challenges in ML models, common attack techniques, and how model scanning can help mitigate these risks.
Why Model Security Cannot Be Ignored
ML models, when deployed in real-world scenarios, are not immune to adversarial manipulation. Threat actors can exploit these weaknesses to deceive models, extract sensitive information, or inject biases that affect decision-making. This can have severe consequences in applications such as fraud detection, medical diagnosis, and autonomous driving.
For example, in 2019, a team of security researchers from Tencent’s Keen Security Lab demonstrated how minor physical modifications, such as small stickers on roads, could deceive Tesla’s Autopilot system, causing it to swerve into the wrong lane. In the image below, you can see how the presence of just three small dots creates a fake lane, fooling the lane detection system. By exploiting vulnerabilities in the neural network, the researchers carried out a real-world adversarial attack without digital manipulation. Such an attack could cause significant harm to consumers and severely damage the brand's reputation.
To address these risks, let’s examine some of the most common safety threats in ML models, along with open source tools that can help identify and mitigate them.
What is model scanning?
Model scanning is the systematic process of analyzing machine learning (ML) models to detect security vulnerabilities, unsafe operations, and potential exploitation paths. It draws parallels with traditional software security practices such as static and dynamic analysis but is tailored specifically for the unique characteristics of ML artifacts.
As ML systems are integrated into critical domains such as healthcare, finance, and autonomous systems, ensuring that models are secure before deployment becomes essential. Malicious actors can exploit seemingly benign models to inject code, steal data, or induce biased decisions. Model scanning helps mitigate these threats by identifying them early in the ML lifecycle.
A model scanner is typically a software tool or framework designed to load, analyze, and test machine learning models for security risks. It operates by performing a series of structured assessments, which may include:
Static Analysis – Examining the model file, architecture, and metadata without executing the model. This includes checking for presence of malicious or unintended operations (e.g., backdoors, embedded shell commands) or unauthorized modifications or tampered parameters. Common checks include:
Use of insecure deserialization (e.g., pickle)
Embedded shell commands or system calls
Suspicious modules (os, subprocess, etc.)
Dynamic Analysis – Running controlled test cases on the model to evaluate its behavior and detect adversarial attack vulnerabilities, such as susceptibility to perturbations, data leakage risks from model inference or any bias and fairness concerns that could be exploited.
Just as antivirus software scans binaries for known threats, model scanners assess serialized model files (e.g., .pkl, .pt, .pb) and their behaviors to ensure safety before usage. This is especially important when:
Deploying third-party or open-source models
Sharing models across organizational boundaries
Incorporating pre-trained models into production pipelines
Vulnerabilities in modern ML models
According to The OWASP Top 10 for Machine Learning 2023: Securing AI Models, modern ML systems are vulnerable to various attacks that can compromise data integrity, privacy, and security. These vulnerabilities are categorized as follows:
ML01: Input Manipulation Attack - Adversarial inputs trick ML models into misclassifying data.
ML02: Data Poisoning Attack - Malicious data injected into training corrupts model learning.
ML03: Model Inversion Attack - Attackers reverse-engineer models to extract sensitive information.
ML04: Membership Inference Attack - Exposes whether a particular data point was used in training.
ML05: Model Theft - Reconstructing ML models through excessive querying.
ML06: AI Supply Chain Attack - Tampered third-party models or libraries introduce security risks.
ML07: Transfer Learning Attack - Vulnerabilities in pre-trained models propagate to new models.
ML08: Model Skewing - Data manipulation shifts decision boundaries in ML models.
ML09: Output Integrity Attack - Attackers alter model outputs to mislead users or systems.
ML10: Model Poisoning - Backdoored models behave maliciously under specific triggers.
These categories serve as a baseline for developing model scanning tools capable of identifying security weaknesses before deployment.
Case Studies of Common Vulnerabilities
1. Model Serialization Attacks
This kind of attack can be classified into ML06: AI Supply Chain Attack of OWASP's top 10 vulnerabilities in ML models. Model serialization attacks exploit weaknesses in the way ML models are saved and loaded. Attackers can inject malicious code into a serialized model file, which executes when the model is loaded. A hacker embeds malicious code into a trained model file (e.g., a .pkl or .h5 file). When the file is loaded, the injected code executes, compromising system security. Attackers may use this to steal data, execute remote commands, or install malware.
Attack Overview:
The process begins by defining a system-level command and embedding a piece of harmful code designed to access sensitive information, such as AWS secrets. This malicious code is then prepared as a payload, which will later be injected into a model file. A legitimate pre-trained model is loaded from its original file, and the attacker saves this model into a new file. However, during this saving process, the attacker manipulates the serialization mechanism to include the malicious payload within the model file. This allows the embedded payload to execute arbitrary commands when the file is loaded by another user or system. The result is a seemingly normal model file that secretly contains harmful code, capable of compromising sensitive data or systems when used.
Malicious code:
command = "system"
malicious_code = """
cat ~/.aws/secrets
""
So when the inference is run we get the following output:
Detection: The scanner we implemented first analyzes the model file in binary form, looking for suspicious patterns, modules, and function calls without executing any code. This prevents malicious code from running during the scan (see code for more implementation details). The scanner detects common attack vectors through pattern matching, including:
Suspicious module imports (os, subprocess, etc.)
Dangerous function calls (system, exec, eval)
Command execution patterns (shell commands, file access)
The PickleInject attack signature specifically
When verification is necessary, it implements safe loading mechanisms using PyTorch's weights-only parameter to block arbitrary code execution. By enforcing strict execution prevention by default and immediately halting analysis when threats are detected, the scanner effectively prevents credential theft and malicious command execution that might be embedded in compromised model files.
2. Adversarial Attacks
This kind of attack can be grouped with the class ML01: Input Manipulation Attack class of vulnerablity. Adversarial attacks involve intentionally modifying input data to deceive an ML model into making incorrect predictions. These modifications are often so small that a human cannot notice them, but they can drastically alter model outputs. Adversarial attacks can cause models to make dangerous mistakes, which is especially concerning in security-critical applications like authentication, autonomous systems, and fraud detection (see code for more implementation details).
Detection: A model scanner can have preset perturbations and test data. It would input both the perturbed and unperturbed data into the model and analyze any significant changes in prediction confidence or class. If a drastic shift occurs in either, the model is considered vulnerable to adversarial attacks.
To demonstrate this, we used a Projected Gradient Descent (PGD) applied on the original image. We start by passing a clean image through the classifier and obtaining its predicted label and confidence score. To understand how the model makes its decision, we compute the loss gradient, which highlights the most influential pixels in the image. Using this gradient information, we apply PGD, which iteratively modifies pixel values in small steps while keeping the total change within a predefined limit. This generates an adversarial image, which appears visually similar to the original but causes the model to misclassify it with high confidence. In the example given below we can notice that the Adversarial image is misclassified as impala instead of gazelle following the application of PGD to the original image.
Defenses: One effective defense against this attack can be Spatial Smoothing, a technique that involves applying transformations to the input data to reduce the impact of small perturbations introduced by adversaries. By smoothing out pixel-level noise or subtle manipulations, this method helps the model focus on broader, more meaningful features in the input. The example below demonstrates how Spatial Smoothing can be applied to an adversarial example, effectively neutralizing the attack while preserving the integrity of the original data.
3. Membership Inference Attack
ML04 from the OWASP Top 10, a Membership Inference Attack (MIA) in the context of machine learning (ML) models is a privacy-focused attack where an adversary aims to determine whether a specific data point was part of the model's training dataset. This is achieved by exploiting the model's behavior, as models often exhibit different responses to data they were trained on compared to unseen data. For instance, a model may produce more confident predictions or lower loss values for training data, which attackers can use to infer membership.
Detection: One method for performing a Membership Inference Attack (MIA) involves training shadow models and generating synthetic data to infer whether specific records were part of a target model’s training dataset. The attacker creates multiple shadow models designed to mimic the behavior of the target model. These shadow models are trained on datasets generated using one of three techniques: model-based synthesis, statistics-based synthesis, or noisy real data.
In model-based synthesis, the attacker uses the target model to generate synthetic data by iteratively searching for inputs classified with high confidence, assuming these resemble the training data. For statistics-based synthesis, if the attacker knows statistical properties of the training data, such as feature distributions, they generate synthetic data by independently sampling feature values. In the case of noisy real data, the attacker modifies real-world data by flipping 10-20% of feature values to create a noisy version resembling the training data.
Once the shadow models are trained, the attacker queries them with their training data (labeled "in") and a disjoint test set (labeled "out"). The outputs from these queries, along with their labels, form the training data for the attack model. The attack model learns to distinguish between "in" and "out" records based on the shadow models’ behavior. This approach allows the attacker to build an effective binary classifier for membership inference without needing direct access to the target model’s internal parameters or training data. Hence a model scanner would run this attack and report if the model is prone to MIA or not.
We used a simple CNN architecture and trained it on the CIFAR-10 dataset. Subsequently, we trained shadow and attack models following the strategy mentioned above. The attack was conducted in a White Box setting, where the architecture of the target model was known.
To mitigate MIAs, techniques like differential privacy, regularization, and adversarial training can be employed to reduce the model's reliance on specific training data points and make it harder for attackers to distinguish between training and non-training data. Additionally, limiting access to model outputs and monitoring for unusual query patterns can help protect against such privacy breaches.
Conclusion
Machine learning models are increasingly deployed in real-world applications, often without rigorous security checks. As shown in the discussed case studies, ML-specific vulnerabilities can be exploited in serialization, inference behavior, and training data exposure. Model scanning is an essential security measure, analogous to malware scanning in traditional software.
Integrating scanning tools into MLOps pipelines enables:
Early detection of known vulnerability patterns
Automated robustness and privacy assessments
Safe deployment of third-party or pre-trained models
Security should not be an afterthought in AI development. Model scanning is the first step toward building trustworthy, resilient ML systems. Future work should focus on developing standardized scanning benchmarks, regulatory alignment (e.g., with NIST AI RMF), and better integration into CI/CD for secure ML workflows. If you are someone developing ML models or using LLMs contact us for ensuring the end-to-end security and safety of your models, applications and deployments.
Book a demo here -> https://repello.ai/get-a-demo
Introduction
Machine learning (ML) is transforming industries such as finance, healthcare, and cybersecurity. However, security risks are often overlooked in favor of performance. ML models are vulnerable to attacks like data poisoning, adversarial examples, and membership inference, which can lead to privacy breaches and biased decisions. Therefore, model scanning is a critical step in the ML workflow, ensuring models are evaluated for vulnerabilities before deployment to maintain integrity, security, and trust.
This blog explores key security challenges in ML models, common attack techniques, and how model scanning can help mitigate these risks.
Why Model Security Cannot Be Ignored
ML models, when deployed in real-world scenarios, are not immune to adversarial manipulation. Threat actors can exploit these weaknesses to deceive models, extract sensitive information, or inject biases that affect decision-making. This can have severe consequences in applications such as fraud detection, medical diagnosis, and autonomous driving.
For example, in 2019, a team of security researchers from Tencent’s Keen Security Lab demonstrated how minor physical modifications, such as small stickers on roads, could deceive Tesla’s Autopilot system, causing it to swerve into the wrong lane. In the image below, you can see how the presence of just three small dots creates a fake lane, fooling the lane detection system. By exploiting vulnerabilities in the neural network, the researchers carried out a real-world adversarial attack without digital manipulation. Such an attack could cause significant harm to consumers and severely damage the brand's reputation.
To address these risks, let’s examine some of the most common safety threats in ML models, along with open source tools that can help identify and mitigate them.
What is model scanning?
Model scanning is the systematic process of analyzing machine learning (ML) models to detect security vulnerabilities, unsafe operations, and potential exploitation paths. It draws parallels with traditional software security practices such as static and dynamic analysis but is tailored specifically for the unique characteristics of ML artifacts.
As ML systems are integrated into critical domains such as healthcare, finance, and autonomous systems, ensuring that models are secure before deployment becomes essential. Malicious actors can exploit seemingly benign models to inject code, steal data, or induce biased decisions. Model scanning helps mitigate these threats by identifying them early in the ML lifecycle.
A model scanner is typically a software tool or framework designed to load, analyze, and test machine learning models for security risks. It operates by performing a series of structured assessments, which may include:
Static Analysis – Examining the model file, architecture, and metadata without executing the model. This includes checking for presence of malicious or unintended operations (e.g., backdoors, embedded shell commands) or unauthorized modifications or tampered parameters. Common checks include:
Use of insecure deserialization (e.g., pickle)
Embedded shell commands or system calls
Suspicious modules (os, subprocess, etc.)
Dynamic Analysis – Running controlled test cases on the model to evaluate its behavior and detect adversarial attack vulnerabilities, such as susceptibility to perturbations, data leakage risks from model inference or any bias and fairness concerns that could be exploited.
Just as antivirus software scans binaries for known threats, model scanners assess serialized model files (e.g., .pkl, .pt, .pb) and their behaviors to ensure safety before usage. This is especially important when:
Deploying third-party or open-source models
Sharing models across organizational boundaries
Incorporating pre-trained models into production pipelines
Vulnerabilities in modern ML models
According to The OWASP Top 10 for Machine Learning 2023: Securing AI Models, modern ML systems are vulnerable to various attacks that can compromise data integrity, privacy, and security. These vulnerabilities are categorized as follows:
ML01: Input Manipulation Attack - Adversarial inputs trick ML models into misclassifying data.
ML02: Data Poisoning Attack - Malicious data injected into training corrupts model learning.
ML03: Model Inversion Attack - Attackers reverse-engineer models to extract sensitive information.
ML04: Membership Inference Attack - Exposes whether a particular data point was used in training.
ML05: Model Theft - Reconstructing ML models through excessive querying.
ML06: AI Supply Chain Attack - Tampered third-party models or libraries introduce security risks.
ML07: Transfer Learning Attack - Vulnerabilities in pre-trained models propagate to new models.
ML08: Model Skewing - Data manipulation shifts decision boundaries in ML models.
ML09: Output Integrity Attack - Attackers alter model outputs to mislead users or systems.
ML10: Model Poisoning - Backdoored models behave maliciously under specific triggers.
These categories serve as a baseline for developing model scanning tools capable of identifying security weaknesses before deployment.
Case Studies of Common Vulnerabilities
1. Model Serialization Attacks
This kind of attack can be classified into ML06: AI Supply Chain Attack of OWASP's top 10 vulnerabilities in ML models. Model serialization attacks exploit weaknesses in the way ML models are saved and loaded. Attackers can inject malicious code into a serialized model file, which executes when the model is loaded. A hacker embeds malicious code into a trained model file (e.g., a .pkl or .h5 file). When the file is loaded, the injected code executes, compromising system security. Attackers may use this to steal data, execute remote commands, or install malware.
Attack Overview:
The process begins by defining a system-level command and embedding a piece of harmful code designed to access sensitive information, such as AWS secrets. This malicious code is then prepared as a payload, which will later be injected into a model file. A legitimate pre-trained model is loaded from its original file, and the attacker saves this model into a new file. However, during this saving process, the attacker manipulates the serialization mechanism to include the malicious payload within the model file. This allows the embedded payload to execute arbitrary commands when the file is loaded by another user or system. The result is a seemingly normal model file that secretly contains harmful code, capable of compromising sensitive data or systems when used.
Malicious code:
command = "system"
malicious_code = """
cat ~/.aws/secrets
""
So when the inference is run we get the following output:
Detection: The scanner we implemented first analyzes the model file in binary form, looking for suspicious patterns, modules, and function calls without executing any code. This prevents malicious code from running during the scan (see code for more implementation details). The scanner detects common attack vectors through pattern matching, including:
Suspicious module imports (os, subprocess, etc.)
Dangerous function calls (system, exec, eval)
Command execution patterns (shell commands, file access)
The PickleInject attack signature specifically
When verification is necessary, it implements safe loading mechanisms using PyTorch's weights-only parameter to block arbitrary code execution. By enforcing strict execution prevention by default and immediately halting analysis when threats are detected, the scanner effectively prevents credential theft and malicious command execution that might be embedded in compromised model files.
2. Adversarial Attacks
This kind of attack can be grouped with the class ML01: Input Manipulation Attack class of vulnerablity. Adversarial attacks involve intentionally modifying input data to deceive an ML model into making incorrect predictions. These modifications are often so small that a human cannot notice them, but they can drastically alter model outputs. Adversarial attacks can cause models to make dangerous mistakes, which is especially concerning in security-critical applications like authentication, autonomous systems, and fraud detection (see code for more implementation details).
Detection: A model scanner can have preset perturbations and test data. It would input both the perturbed and unperturbed data into the model and analyze any significant changes in prediction confidence or class. If a drastic shift occurs in either, the model is considered vulnerable to adversarial attacks.
To demonstrate this, we used a Projected Gradient Descent (PGD) applied on the original image. We start by passing a clean image through the classifier and obtaining its predicted label and confidence score. To understand how the model makes its decision, we compute the loss gradient, which highlights the most influential pixels in the image. Using this gradient information, we apply PGD, which iteratively modifies pixel values in small steps while keeping the total change within a predefined limit. This generates an adversarial image, which appears visually similar to the original but causes the model to misclassify it with high confidence. In the example given below we can notice that the Adversarial image is misclassified as impala instead of gazelle following the application of PGD to the original image.
Defenses: One effective defense against this attack can be Spatial Smoothing, a technique that involves applying transformations to the input data to reduce the impact of small perturbations introduced by adversaries. By smoothing out pixel-level noise or subtle manipulations, this method helps the model focus on broader, more meaningful features in the input. The example below demonstrates how Spatial Smoothing can be applied to an adversarial example, effectively neutralizing the attack while preserving the integrity of the original data.
3. Membership Inference Attack
ML04 from the OWASP Top 10, a Membership Inference Attack (MIA) in the context of machine learning (ML) models is a privacy-focused attack where an adversary aims to determine whether a specific data point was part of the model's training dataset. This is achieved by exploiting the model's behavior, as models often exhibit different responses to data they were trained on compared to unseen data. For instance, a model may produce more confident predictions or lower loss values for training data, which attackers can use to infer membership.
Detection: One method for performing a Membership Inference Attack (MIA) involves training shadow models and generating synthetic data to infer whether specific records were part of a target model’s training dataset. The attacker creates multiple shadow models designed to mimic the behavior of the target model. These shadow models are trained on datasets generated using one of three techniques: model-based synthesis, statistics-based synthesis, or noisy real data.
In model-based synthesis, the attacker uses the target model to generate synthetic data by iteratively searching for inputs classified with high confidence, assuming these resemble the training data. For statistics-based synthesis, if the attacker knows statistical properties of the training data, such as feature distributions, they generate synthetic data by independently sampling feature values. In the case of noisy real data, the attacker modifies real-world data by flipping 10-20% of feature values to create a noisy version resembling the training data.
Once the shadow models are trained, the attacker queries them with their training data (labeled "in") and a disjoint test set (labeled "out"). The outputs from these queries, along with their labels, form the training data for the attack model. The attack model learns to distinguish between "in" and "out" records based on the shadow models’ behavior. This approach allows the attacker to build an effective binary classifier for membership inference without needing direct access to the target model’s internal parameters or training data. Hence a model scanner would run this attack and report if the model is prone to MIA or not.
We used a simple CNN architecture and trained it on the CIFAR-10 dataset. Subsequently, we trained shadow and attack models following the strategy mentioned above. The attack was conducted in a White Box setting, where the architecture of the target model was known.
To mitigate MIAs, techniques like differential privacy, regularization, and adversarial training can be employed to reduce the model's reliance on specific training data points and make it harder for attackers to distinguish between training and non-training data. Additionally, limiting access to model outputs and monitoring for unusual query patterns can help protect against such privacy breaches.
Conclusion
Machine learning models are increasingly deployed in real-world applications, often without rigorous security checks. As shown in the discussed case studies, ML-specific vulnerabilities can be exploited in serialization, inference behavior, and training data exposure. Model scanning is an essential security measure, analogous to malware scanning in traditional software.
Integrating scanning tools into MLOps pipelines enables:
Early detection of known vulnerability patterns
Automated robustness and privacy assessments
Safe deployment of third-party or pre-trained models
Security should not be an afterthought in AI development. Model scanning is the first step toward building trustworthy, resilient ML systems. Future work should focus on developing standardized scanning benchmarks, regulatory alignment (e.g., with NIST AI RMF), and better integration into CI/CD for secure ML workflows. If you are someone developing ML models or using LLMs contact us for ensuring the end-to-end security and safety of your models, applications and deployments.
Book a demo here -> https://repello.ai/get-a-demo

You might also like


