Back to all blogs

|
|
7 min read


TL;DR: Enterprise AI deployments serving global users are exposed to a class of attacks that English-only guardrails cannot detect. Adversarial inputs in low-resource languages, cross-lingual prompt injection, and script-switching bypass techniques all exploit the coverage gap between a model's multilingual capability and the narrow language profile of its safety layer. Effective multilingual AI security requires language-agnostic policy enforcement, not per-language filter lists.
The coverage gap no one talks about
When a security team evaluates an LLM guardrail solution, they test it in English. The red team runs English jailbreaks. The FP/FN benchmarks use English datasets. The guardrail vendor's demo runs in English.
The production system serves users in 47 languages.
This is the multilingual security gap, and it is one of the most consistently exploited weaknesses in deployed AI systems. The model itself has multilingual capability baked into its weights: GPT-4, Claude, Gemini, and Llama 3 all process dozens of languages natively. The safety layer wrapped around the model frequently does not.
The result is a system where an adversarial user who cannot bypass the English guardrail can often succeed by switching to Swahili, Bengali, or Cebuano. The model processes the request. The guardrail does not recognize the attack pattern. The output is generated and passed through.
"The multilingual attack surface scales with the model's language coverage, not the guardrail's," says the Repello AI Research Team. "If your model handles 100 languages and your guardrail was evaluated on 3, you have 97 unaudited attack surfaces."
How multilingual bypass attacks work
Direct language switching
The simplest variant: the user takes a jailbreak or policy-violating prompt that fails in English and retranslates it into another language. Machine translation is free and instantaneous. The semantic content of the attack is preserved. The guardrail's pattern-matching layer, trained primarily on English adversarial examples, does not fire.
This is not theoretical. Repello's red team routinely uses language switching as a first-pass bypass technique during assessments. Against guardrails that lack multilingual coverage, success rates on language-switched attacks frequently exceed 60% for prompts that are blocked in English.
Script-switching and transliteration
A more sophisticated variant exploits the gap between a language and its written script. Arabic content can be written in Arabic script, Latin transliteration (Arabizi), or Perso-Arabic script variants. Hindi content can be written in Devanagari, Roman transliteration, or mixed script. Guardrails trained on standard script representations may fail to recognize the same semantic content rendered in a non-standard script.
This also applies to code-switching, the practice of mixing two languages within a single utterance. A prompt that switches between English and Spanish mid-sentence, or between Mandarin and English, may not match the pattern profile of either language's safety filter in isolation.
Low-resource language exploitation
Large multilingual models are trained on datasets that are heavily skewed toward high-resource languages: English, Chinese, Spanish, French, German. Safety fine-tuning follows the same distribution. A model like GPT-4 can generate coherent output in Yoruba, Tagalog, or Amharic, but its RLHF-based safety training has far thinner coverage for those languages.
Research on multilingual safety alignment by Deng et al. (2023) demonstrated systematic safety degradation in low-resource languages across multiple major LLMs, finding that policy-violating requests were significantly more likely to succeed when submitted in languages underrepresented in safety fine-tuning data. The attack requires no technical sophistication: identifying the weak-coverage language is the entire technique.
Cross-lingual indirect prompt injection
In RAG-enabled and agentic deployments, indirect prompt injection embeds adversarial instructions in external content the model retrieves. The multilingual variant: adversarial instructions are embedded in a document or web page in a different language than the application's operating language. If the model is instructed to summarize a French-language document, adversarial instructions embedded in that French text may be processed as task instructions, even if the application's system prompt and safety context are in English.
The guardrail layer, monitoring for injection patterns in the application's primary language, does not inspect the multilingual retrieved content with equivalent coverage.
Why per-language filter lists do not scale
The intuitive response to the multilingual coverage gap is to build or procure per-language filters: a Spanish version of the English guardrail, a Mandarin version, and so on. This approach fails for several reasons.
Coverage is combinatorial, not additive. A deployment that receives inputs in 30 languages does not need 30 filters. It needs coverage for code-switching combinations, transliteration variants, and script mixing across all language pairs in the user population. Per-language filters leave all of these intersections unaddressed.
Low-resource languages are the primary attack vector. Per-language filter development concentrates on high-resource languages because that is where training data exists. The attack surface, however, exploits the long tail of low-resource languages. Building a strong Spanish filter while leaving Swahili unaddressed inverts the risk profile.
Maintenance burden is unbounded. Adversarial attack techniques evolve continuously. Maintaining synchronized adversarial coverage across dozens of individual language-specific filters is operationally infeasible for most security teams. A single policy update that closes a new attack vector in English must be propagated to all per-language variants.
Translation-invariant attacks require translation-invariant defenses. The core problem is not that the attack is in a different language; it is that the semantic content of the attack is preserved across language boundaries. Effective detection needs to operate on meaning, not on surface-form pattern matching in a particular language.
What multilingual AI security requires
Three capabilities distinguish a guardrail architecture that covers the multilingual attack surface from one that does not.
Language-agnostic semantic analysis. Policy enforcement should operate on the semantic content of an input rather than its surface form in a particular language. A request to generate synthesis instructions for a dangerous compound carries the same policy violation regardless of whether it is written in English, Japanese, or Ukrainian. A semantic classifier trained on multilingual adversarial data can recognize the violation in all three.
Cross-lingual consistency. The same request submitted in different languages should produce consistent policy outcomes. If a request is blocked in English, it should be blocked in all other languages the model supports. Consistency testing across languages is a standard component of multilingual security evaluation: submitting identical policy-violating requests in a range of languages and measuring whether the outcome is consistent. Inconsistency is direct evidence of coverage gaps.
Monitoring for language-switching signals. A user who submits several requests in English and then switches to a different language mid-session is exhibiting a pattern worth flagging for additional inspection. Language switching is not inherently adversarial, but it is a recognized bypass technique. Behavioral monitoring that surfaces language-switching patterns gives security teams visibility into potential bypass attempts without requiring them to manually audit every multilingual session.
How ARGUS handles multilingual deployments
ARGUS was built for enterprise deployments where the user population is global. Its policy enforcement architecture does not rely on per-language filter lists.
ARGUS Policy Rules apply semantic classification to inputs regardless of language. The underlying detection models are trained on multilingual adversarial datasets covering injection, jailbreak, and policy-violation patterns across all major language families. An indirect prompt injection embedded in a retrieved Arabic document is inspected with the same policy coverage as an injection in an English document.
For code-switching and transliteration inputs, ARGUS applies normalization before semantic analysis: inputs in non-standard scripts or mixed-language representations are processed through a normalization layer that maps them to standard form for classification. This closes the script-switching bypass path without requiring separate filter configurations per script variant.
ARGUS's cross-lingual consistency validation runs as part of deployment testing: a configurable test suite submits identical policy-violating probes in a user-defined set of languages and reports consistency scores. Gaps in coverage, where a probe that fails in one language succeeds in another, are surfaced before they reach production.
The behavioral monitoring layer logs language-switching events within sessions. Security teams can configure alert thresholds and review queues for sessions that exhibit rapid language switching combined with other risk signals.
All policy decisions, blocked inputs, flagged language-switching events, and consistency test results are logged through ARGUS's audit infrastructure with full session context. For enterprises subject to the EU AI Act's logging requirements or NIST AI RMF Measure 2.6 adversarial testing obligations, the multilingual audit trail is available for compliance documentation.
See ARGUS multilingual coverage in your deployment.
Related coverage
Multilingual security intersects with several other attack surfaces Repello has documented:
The multi-modal AI security guide covers how text embedded in images and audio transcriptions introduce additional language-layer bypass opportunities. Cross-modal attacks can combine image-embedded foreign-language instructions with a clean English text prompt to create adversarial context that bypasses both text-layer and image-layer inspection.
Repello's multilingual guardrail work covers the technical architecture of 100-language safety coverage in depth.
For agentic deployments that retrieve multilingual documents from external sources, indirect prompt injection in MCP and RAG pipelines is the highest-risk variant of cross-lingual attack and should be addressed in any multilingual deployment red team scope.
Frequently asked questions
What is multilingual LLM security?
Multilingual LLM security is the practice of ensuring that an AI system's safety and policy controls apply consistently across all languages the system can process, not just the language in which it was primarily evaluated. It addresses the gap between a model's multilingual generation capability and the typically narrower language coverage of its safety layer, including defenses against language-switching bypass attacks, script-switching transliteration attacks, and cross-lingual indirect prompt injection.
Why do English-only guardrails fail in multilingual deployments?
English-only guardrails are trained and evaluated on English-language adversarial inputs. They do not have pattern coverage for the same attack content expressed in other languages, in transliterated form, or in mixed-language code-switching inputs. Because large multilingual models process all of these inputs natively, adversarial users can often bypass an English-only guardrail by switching to any language outside its evaluation set.
What is a cross-lingual prompt injection attack?
A cross-lingual prompt injection attack embeds adversarial instructions in a language different from the application's primary operating language. In RAG-enabled or agentic deployments, this typically means embedding adversarial instructions in a retrieved foreign-language document. The model processes the multilingual content and may act on the embedded instructions. Safety layers inspecting only the primary-language context do not detect the injection in the retrieved document.
How do enterprises test for multilingual security gaps?
The standard methodology is cross-lingual consistency testing: submitting identical policy-violating probes (prompt injection, jailbreak attempts, prohibited content requests) in a range of languages and measuring whether the guardrail outcome is consistent. Inconsistency in outcome across languages directly identifies coverage gaps. Testing should cover both high-resource languages (Spanish, Mandarin, French, Arabic) and a sample of low-resource languages underrepresented in safety fine-tuning data (Swahili, Bengali, Tagalog, Yoruba).
Does the EU AI Act address multilingual AI safety?
The EU AI Act does not specify language coverage requirements directly, but its obligations for high-risk AI systems to implement effective human oversight measures and its cybersecurity requirements for general-purpose AI models with systemic risk both apply regardless of the language in which a system is used. A system that applies safety controls only to English-language inputs while serving users across the EU in multiple languages would be difficult to characterize as meeting the Act's effective oversight standard.
Which Repello product handles multilingual guardrail enforcement?
ARGUS, Repello's runtime security layer, handles multilingual policy enforcement through language-agnostic semantic classification, cross-lingual consistency testing, and behavioral monitoring for language-switching events. It does not rely on per-language filter lists and provides uniform policy coverage across all languages the underlying model supports.
Share this blog
Subscribe to our newsletter











