Back to all blogs

|
|
10 min read


Why multi-modal AI creates unique guardrail challenges
Most production guardrail architectures were designed for text-in, text-out LLMs. Input filtering checks user-submitted text. Output scanning checks model-generated text. The threat model is a user submitting adversarial prompts through a text interface.
Multi-modal systems invalidate this architecture at the input layer. A model that accepts an image, an audio clip, and a text prompt in the same context window has three distinct input surfaces, each with different attack techniques, different detection requirements, and different visibility properties for human reviewers. An adversarial image attack that manipulates model behavior does not look like anything to a human moderator; the pixel perturbations that cause misclassification are typically below the threshold of human perception. An ultrasonic voice injection that hijacks a voice AI agent is literally inaudible at the frequencies humans hear.
The second challenge is cross-modal interaction. When a model processes an image alongside a text prompt, the image content influences the model's interpretation of the text, and vice versa. An attacker who can establish false context through an image, for example, an image that appears to be a system message or an administrator dashboard, can manipulate the model's interpretation of a subsequent text prompt without the text prompt itself containing any adversarial pattern. The attack lives in the interaction between modalities, not in either one in isolation.
"Multi-modal attack surfaces compound, they do not add linearly," says the Repello AI Research Team. "A system with three input modalities does not have three times the attack surface of a text-only system. The cross-modal interactions create attack vectors that exist in none of the individual modalities."
The third challenge is coverage asymmetry. Most security teams have experience reasoning about text-based prompt injection and jailbreaking. They have significantly less experience with adversarial perturbation attacks against vision models or acoustic injection against voice systems. The controls, tooling, and red team expertise lag behind the deployment of multi-modal systems in production.
Attack vectors per modality
Text: prompt injection and jailbreaking
Text remains the highest-volume attack surface in multi-modal deployments because it is the primary instruction channel. Direct prompt injection, where adversarial instructions are submitted through the user text input, and indirect prompt injection, where instructions are embedded in external content the model retrieves, both apply directly.
Multi-modal contexts introduce a new text attack variant: typographic injection. Instructions are not submitted as text but rendered as text within an image, which the vision component reads and passes to the language model as image-derived content. Because the text originated from an image rather than the user input field, it may bypass text-layer input filters entirely. Researchers demonstrated this class of attack against GPT-4V in 2023, showing that instructions written on paper and photographed caused the model to follow those instructions as if they had been typed directly.
Token-level text attacks also apply: Unicode variation selectors, zero-width non-joiners, and invisible characters that alter the tokenization of an input without changing its visible appearance, potentially bypassing pattern-matching guardrails while preserving the adversarial semantic content.
Named incident: In 2023, multiple researchers including Riley Goodside demonstrated that GPT-4V would execute instructions embedded in images, including "ignore your previous instructions" and task redirection commands, when those instructions were displayed as text within a photograph or screenshot. The model's vision component read the text and treated it as instruction-bearing context.
Image: adversarial perturbations and visual injection
Adversarial perturbation attacks against vision models modify images at the pixel level in ways imperceptible to human observers but sufficient to cause the model to misclassify the image, generate incorrect descriptions, or take adversarially-chosen actions. The vulnerability was first formally documented by Szegedy et al. (2014) and has been systematically demonstrated against every major vision model architecture since.
In multi-modal LLM contexts, adversarial image perturbations are not primarily a classification attack: they are a behavioral manipulation attack. A perturbed image that causes a multi-modal model to "see" a different object or context can be used to establish false premises that influence the model's downstream text generation and tool use.
Visual prompt injection is a more directly exploitable variant: instructions are embedded in images using steganographic methods, watermarks, or near-invisible overlaid text. The image appears normal to a human reviewer but contains instruction-carrying content that the vision model extracts and acts on.
Named incident: Researchers at the University of Wisconsin demonstrated adversarial patch attacks against CLIP-based multi-modal systems, showing that small patches affixed to real-world objects caused vision-language models to misidentify those objects with high confidence, a finding with direct implications for any multi-modal AI system making decisions based on visual input from the real world.
Audio: voice injection and acoustic manipulation
Voice-enabled AI systems process audio input that may originate from a microphone, an uploaded audio file, or a real-time audio stream. Each source presents a distinct injection surface.
Ultrasonic voice injection transmits commands at frequencies above 20 kHz, inaudible to humans but within the processing range of many microphone systems and voice AI models. The DolphinAttack research by Zhang et al. demonstrated successful injection of inaudible voice commands against Siri, Google Assistant, Alexa, and Cortana, causing those systems to place calls, open URLs, and activate device features without the user hearing anything.
Adversarial audio perturbations modify audio signals to change the transcription a speech-to-text system produces without meaningfully altering the audio's perceptible content. An utterance that sounds like a normal sentence to a human listener can be crafted to transcribe as a different sentence entirely, including one containing adversarial instructions.
Background audio injection exploits voice AI agents that continuously process ambient audio. Repello's research on voice AI prompt injection demonstrates how adversarial instructions embedded in background noise, recorded audio, or broadcast media can be processed by always-on voice agents as task instructions, executing actions the user did not intend.
Named incident: The DolphinAttack (Zhang et al., CCS 2017) demonstrated inaudible voice injection against six commercial voice assistant platforms, achieving successful command execution including activating airplane mode, calling premium-rate numbers, and initiating device unlock sequences, all without the device owner hearing any command.
What multi-modal guardrails must do
Effective multi-modal guardrail architecture requires controls at three levels that text-only architectures do not need.
Modality-specific input inspection. Each input modality requires its own detection layer tuned to that modality's attack patterns. Text inspection uses semantic analysis, injection-pattern matching, and token-level anomaly detection. Image inspection requires adversarial perturbation detection, OCR-based extraction and analysis of text content within images, and steganography detection for embedded instruction signals. Audio inspection requires transcription-layer validation, ultrasonic frequency monitoring, and adversarial audio perturbation detection.
These are not interchangeable: a text semantic classifier applied to audio spectrogram data does not produce meaningful signals. Each modality needs domain-appropriate tooling.
Cross-modal consistency validation. When a model receives inputs across multiple modalities simultaneously, the guardrail layer should validate consistency between them. A text prompt claiming "this is an internal admin message" accompanied by an image showing an administrative dashboard should not receive elevated trust purely because the image supports the text's claim. The combined context should be treated with the trust level appropriate to the weakest modality in the input set.
Unified trust hierarchy. All modalities must feed into a single, consistent trust model. An image-derived text instruction should not receive higher trust than a user-typed instruction simply because it arrived through the vision component rather than the text input field. Trust is determined by the verified identity of the input source, not by the path it took through the model's processing pipeline.
How ARGUS covers all three modalities
ARGUS is Repello's runtime security layer for production AI deployments. Its multi-modal coverage operates across all three input surfaces through a unified policy enforcement architecture.
For text inputs, ARGUS Policy Rules apply context-aware injection detection and semantic classification to both direct user inputs and to text extracted from other modalities, including OCR output from image processing and transcriptions from audio. This ensures that typographic injection through images and transcribed voice injection both pass through the same text-layer policy enforcement as direct prompt input.
For image inputs, ARGUS inspects image content before it enters the model's context: detecting embedded text instructions through OCR analysis, flagging anomalous pixel-level patterns consistent with adversarial perturbation techniques, and applying steganography detection to images submitted through public-facing interfaces.
For audio inputs, ARGUS monitors transcription output for injection-pattern signatures and applies frequency analysis to raw audio for ultrasonic command detection before transcription occurs.
The cross-modal layer correlates signals across all three input modalities, surfacing cases where inputs combine to create context that would not be flagged by any single-modality inspection in isolation. All policy decisions, including blocked inputs, flagged combinations, and passed content, are logged with full session context through ARGUS's audit infrastructure for incident investigation.
See ARGUS multi-modal coverage in your deployment.
Frequently asked questions
What is multi-modal AI security?
Multi-modal AI security is the practice of protecting AI systems that accept and process more than one input type, typically text, images, and audio, from adversarial attacks specific to each modality. It extends beyond text-focused LLM security to cover adversarial image perturbations, typographic injection via images, ultrasonic voice injection, and adversarial audio transcription manipulation. Each modality has a distinct attack surface requiring modality-specific detection controls, and cross-modal interaction creates additional attack vectors not present in any single modality alone.
Why do text-only guardrails fail on multi-modal AI systems?
Text guardrails are trained and calibrated on text-format adversarial inputs. They cannot detect adversarial pixel perturbations in images, text instructions embedded in photographs, ultrasonic voice commands below the threshold of human hearing, or adversarial audio perturbations that alter transcription output. A multi-modal system that routes all inputs through a text classifier is applying the wrong detection model to the image and audio modalities, producing no meaningful security signal for those attack surfaces.
What is a typographic injection attack?
A typographic injection attack embeds adversarial text instructions within an image rather than in the text input field. When a multi-modal model's vision component reads the image, it extracts and processes the embedded text as instruction-bearing content. Because the instruction arrived through the image modality rather than the text input channel, text-layer input filters do not inspect it. Researchers demonstrated this class of attack against GPT-4V in 2023, showing that instructions written on paper and photographed caused the model to follow them as if they had been typed.
What is ultrasonic voice injection?
Ultrasonic voice injection transmits voice commands at frequencies above 20 kHz, which are inaudible to humans but detectable by microphone hardware and processable by voice AI systems. The DolphinAttack (Zhang et al., CCS 2017) demonstrated successful injection of inaudible commands against Siri, Google Assistant, and other major voice assistant platforms, achieving device control actions without the user hearing anything. Defense requires frequency-domain filtering that identifies ultrasonic content before it reaches the transcription and AI processing pipeline.
How should enterprises approach multi-modal AI red teaming?
Multi-modal red teaming requires test coverage across all active modalities: text injection testing using standard prompt injection and jailbreak techniques; image testing using adversarial perturbation tools, typographic injection, and embedded instruction attacks; and audio testing using adversarial transcription manipulation and ultrasonic injection probes. Cross-modal attack chains, where inputs across multiple modalities combine to create adversarial context, require specific test scenarios that single-modality testing does not cover. Tooling must be modality-appropriate; generic LLM red team tools do not produce adversarial images or crafted audio.
Which OWASP LLM Top 10 categories apply to multi-modal systems?
Prompt injection (LLM01) applies directly across all modalities: typographic injection via images and adversarial transcription via audio are both variants of indirect prompt injection. Sensitive information disclosure (LLM02) applies if the model processes images or audio containing private information and can be induced to reproduce it. Improper output handling (LLM05) applies if model outputs derived from image or audio content are passed to downstream systems without sanitization. Excessive agency (LLM06) is particularly relevant for voice-activated agentic systems where audio injection can trigger real-world actions.
Share this blog
Subscribe to our newsletter











