Guardrails
Context
Section titled “Context”Your AI assistant for customer service agents has been live for three weeks. Then it happens: a customer asks about a sensitive medical topic, and the assistant provides detailed medical advice — without any referral to a doctor.
The next day, another case escalates: an agent wanted to answer a perfectly legitimate question about pregnancy-related insurance benefits, but the content filter blocks everything containing “pregnancy.” The agent types the answer manually — and never uses the AI assistant for this topic again.
Two failure modes. One system. Welcome to the guardrails dilemma.
Concept
Section titled “Concept”What guardrails are (and aren’t)
Section titled “What guardrails are (and aren’t)”Guardrails are technical and product mechanisms that constrain AI behavior within acceptable boundaries. They are not censorship — they are product requirements expressed as constraints. A calculator won’t let you divide by zero. A banking app won’t let you transfer negative amounts. AI guardrails are the same concept applied to probabilistic systems.
Three categories
Section titled “Three categories”Technical guardrails — input/output filters, content classifiers, safety models:
- Input rails: content classification, PII detection, jailbreak detection, topic control
- Output rails: fact-checking against sources, content safety filtering, format validation, confidence thresholds
Product guardrails — usage limits, feature restrictions, user-facing policies
Operational guardrails — monitoring, alerting, human-in-the-loop escalation
Tools and frameworks
Section titled “Tools and frameworks”| Tool | Approach | Key feature |
|---|---|---|
| NVIDIA NeMo Guardrails | Open-source toolkit | Colang DSL for defining rails; “Adopt” status on ThoughtWorks Radar |
| Guardrails AI | Open-source framework | Validator-based; 100+ pre-built validators |
| Llama Guard | Safety classifier | Meta’s content safety model; open weights |
| Azure AI Content Safety | Cloud service | Enterprise-grade; integrates with Azure OpenAI |
The over-blocking problem
Section titled “The over-blocking problem”The most common failure mode isn’t being too permissive — it’s being too restrictive. Over-blocking:
- Frustrates users who then switch to unmonitored alternatives (shadow AI)
- Blocks legitimate use cases (academic research, medical terminology, security research)
- Erodes trust (“this tool is useless for my real work”)
- Trains users to rephrase dishonestly to bypass filters
Framework
Section titled “Framework”The Guardrails Calibration Guide — balancing safety and utility:
| Principle | Implementation | Measurement |
|---|---|---|
| Tunable thresholds | Different contexts need different strictness levels | Block rate per context |
| Layered approach | Multiple lightweight checks instead of one aggressive filter | False-positive rate per layer |
| Context-aware filtering | Same word, different meaning depending on context | Context-dependent accuracy |
| User feedback loops | Users report both over-blocking and under-blocking | Feedback volume and trends |
| Graceful degradation | When uncertain: output with warning instead of blocking | Warning-to-block ratio |
The golden rule: Measure both block rate AND user satisfaction. Neither metric alone tells the full story.
Scenario
Section titled “Scenario”You’re the PM of an AI writing assistant for insurance advisors. The assistant helps draft customer letters. After launch, the following data comes in:
Week 1-4 metrics:
- 12,000 generated letters per week
- Block rate: 23% of all requests are filtered
- Support tickets “AI blocked my request”: 340 per week
- After manual review: 89% were legitimate requests (false positives)
- User satisfaction score: 3.1/5 (target was 4.0+)
- Actual problematic outputs (found through QA sampling): 0.3% of non-blocked requests
The Head of Compliance says: “The block rate needs to stay high — better to over-filter than under-filter.” The Head of Sales says: “Advisors aren’t using the tool anymore.”
Decide
Section titled “Decide”How would you decide?
The best decision: Recalibrate the guardrails — not loosen them, but make them more precise. Specifically: introduce context-aware filtering that recognizes insurance terminology as legitimate context.
Why:
- 89% false positives means the filter is broken, not too strict. It’s hitting the wrong targets
- A 23% block rate with 0.3% actual problems is a ratio of roughly 77:1 (false positives to true positives) — that’s not a safety feature, it’s a broken filter
- Frustrated users switch to unmonitored tools, which INCREASES risk instead of reducing it
- The solution is a layered approach: broad filter for clearly problematic content + context-aware filter for domain terminology + user feedback loop for edge cases
What many get wrong: Deferring to the compliance team and keeping the high block rate — without understanding that over-blocking itself is a security risk (shadow AI).
Reflect
Section titled “Reflect”The safest AI product is one that people actually use within its guardrails — not one that drives them to unmonitored alternatives.
- Guardrails are not a one-time implementation — adversarial users constantly find new bypass techniques
- Provider guardrails are generic; your product has domain-specific risks that require product-specific guardrails
- More guardrails doesn’t automatically mean more safety — precision beats aggression
Sources: NVIDIA NeMo Guardrails Documentation, Guardrails AI, ThoughtWorks Technology Radar, Obsidian Security — AI Guardrails Analysis