What are AI Guardrails?
AI guardrails are runtime controls — typically classifiers, rule engines, or smaller policy models — that inspect the inputs and outputs of an LLM application to enforce safety, security, and compliance policies the underlying model cannot guarantee on its own. Where the foundation model's safety training is best-effort and probabilistic, guardrails are the deterministic enforcement layer the application owner controls.
What guardrails actually do
A typical guardrail layer sits between the user and the model on input, and between the model and the user on output:
user input → [input guardrail] → LLM → [output guardrail] → responseInput-side guardrails screen for:
- Prompt injection patterns and known jailbreak templates
- Personally identifiable information the application shouldn't accept
- Off-topic queries the deployment isn't supposed to serve
- Content categorization (toxicity, hate speech, illegal-activity requests)
Output-side guardrails screen for:
- Hallucinated factual claims, when grounding sources are available
- Policy-violating responses (off-topic, off-brand, off-policy)
- Sensitive data leakage (system prompts, internal documentation, PII)
- Tool-call validity (arguments outside allowed schemas, calls violating per-tool rate limits)
Common guardrail products
- NVIDIA NeMo Guardrails — open-source, programmable in Colang
- Guardrails AI — open-source, schema-validating output guardrails
- AWS Bedrock Guardrails — managed, integrates with Bedrock-hosted models
- OpenAI Moderation API — content classification only, content-policy focused
- Repello ARGUS — runtime AI security layer with prompt-injection, jailbreak, and tool-abuse detection
What guardrails don't solve
Guardrails are necessary but not sufficient. Three documented limitations:
- Encoding bypasses — guardrails that classify pre-tokenization text miss payloads encoded in base64, Unicode variation selectors, zero-width characters, or foreign-language scripts that the underlying model still interprets correctly.
- Multi-turn evasion — single-message guardrails can't detect attacks that build context across many turns.
- Coverage gaps — every guardrail's training distribution has blind spots; novel attack patterns appear faster than signature updates ship.
The right framing: guardrails raise the cost of attack and catch the bottom 90% of probes, but a determined attacker will eventually route around any single layer. Defense-in-depth (input + output + abuse detection + continuous testing) is the real posture.