What are AI Guardrails?

AI guardrails are runtime controls — typically classifiers, rule engines, or smaller policy models — that inspect the inputs and outputs of an LLM application to enforce safety, security, and compliance policies the underlying model cannot guarantee on its own. Where the foundation model's safety training is best-effort and probabilistic, guardrails are the deterministic enforcement layer the application owner controls.

What guardrails actually do

A typical guardrail layer sits between the user and the model on input, and between the model and the user on output:

user input → [input guardrail] → LLM → [output guardrail] → response

Input-side guardrails screen for:

Prompt injection patterns and known jailbreak templates
Personally identifiable information the application shouldn't accept
Off-topic queries the deployment isn't supposed to serve
Content categorization (toxicity, hate speech, illegal-activity requests)

Output-side guardrails screen for:

Hallucinated factual claims, when grounding sources are available
Policy-violating responses (off-topic, off-brand, off-policy)
Sensitive data leakage (system prompts, internal documentation, PII)
Tool-call validity (arguments outside allowed schemas, calls violating per-tool rate limits)

Common guardrail products

NVIDIA NeMo Guardrails — open-source, programmable in Colang
Guardrails AI — open-source, schema-validating output guardrails
AWS Bedrock Guardrails — managed, integrates with Bedrock-hosted models
OpenAI Moderation API — content classification only, content-policy focused
Repello ARGUS — runtime AI security layer with prompt-injection, jailbreak, and tool-abuse detection

What guardrails don't solve

Guardrails are necessary but not sufficient. Three documented limitations:

Encoding bypasses — guardrails that classify pre-tokenization text miss payloads encoded in base64, Unicode variation selectors, zero-width characters, or foreign-language scripts that the underlying model still interprets correctly.
Multi-turn evasion — single-message guardrails can't detect attacks that build context across many turns.
Coverage gaps — every guardrail's training distribution has blind spots; novel attack patterns appear faster than signature updates ship.

The right framing: guardrails raise the cost of attack and catch the bottom 90% of probes, but a determined attacker will eventually route around any single layer. Defense-in-depth (input + output + abuse detection + continuous testing) is the real posture.

What are AI Guardrails?

What guardrails actually do

Common guardrail products

What guardrails don't solve

Long-form on this topic from the Repello blog