Tous les articles
Agents··11 min de lecture

Forced Tool Calling — How to Kill the Almost-Right Sentence in Production Chatbots

The failure mode that takes down most production conversational agents isn't hallucination — it's the sentence that sounds confident and is almost right. Here is the architecture that fixes it.

Ikki
Founder & AI Engineer at Ikki
Forced Tool Calling — How to Kill the Almost-Right Sentence in Production Chatbots

The real failure mode

Most production chatbots don't go down with hallucinations. They go down with sentences that sound confident and are almost right.

A made-up SLA. A signature that impersonates an ops manager. An "I'll get back to you by 5pm" promise the system can't keep. A fake escalation reference number. A subtly wrong policy paraphrase.

These are the failures that erode trust. They're hard to detect with eval (the sentence parses, scores reasonable on faithfulness, even sounds correct to a casual reviewer) and they're impossible to bound with prompt rules. The model has been trained on millions of confident sentences. Asking it nicely not to write one is not a control.

Across the conversational agents we've shipped to production — voice and text, customer-facing and operator-facing — we converged on a single architectural answer for high-stakes contexts: don't let the LLM write the user-facing prose at all.

This article is the pattern, the trade-offs, and the production code path.

Why prompt rules don't fix it

The instinctive fix when an "almost-right" sentence ships in prod is to add a rule to the system prompt:

Never promise a callback time. Never sign messages. Never invent SLAs. If unsure, say "I'll escalate to a human."

This works for a week. Then it breaks. The model is great at producing prose that feels compliant — it'll sign with a generic role ("the team", "your support agent") that wasn't on the forbidden list. It'll promise a "fast response" instead of a specific time. It paraphrases the policy in a way that's 90% right.

You can stack rules forever. You're playing whack-a-mole with a model that has more ways to bend the rules than you have time to enumerate.

The structural fix is to remove the LLM's ability to write user-facing text at all.

The architecture in one sentence

The LLM's only job is to pick a tool from a closed palette and provide structured arguments. The reply that lands on the user's screen comes from a deterministic template filled with those arguments. Versioned, reviewable, A/B-able, structurally impossible to hallucinate.

This pattern has three layers in production:

  1. A deterministic gate for trivial cases (acknowledgments, salutations) so we don't burn tokens on "ok merci 👍"
  2. A forced tool-calling LLM with a closed palette (one tool per turn, no free text)
  3. A second-model safety re-check when the chosen tool is "stay silent", to catch suppressed signals

Each layer is observable, swappable, and adds a specific guarantee.

Layer 1 — deterministic regex gate

A meaningful share of messages in any chat-ops product are pure ACKs: "ok", "merci", "👍", "compris", "thanks". Sending these to an LLM is wasteful — both in tokens and in latency. A small regex-based classifier catches them with zero false positives:

const ACK_PATTERNS = [
  /^(ok|okay|d'accord|compris|merci|thanks?)\s*[!.👍🙏]*$/i,
  /^(super|parfait|nickel|cool|great|perfect)\s*[!.]*$/i,
  /^👍+$/,
  /^[\p{Emoji}\s]+$/u, // emoji-only
]

function maybeAck(text: string): boolean {
  const trimmed = text.trim()
  if (trimmed.length > 30) return false
  return ACK_PATTERNS.some(p => p.test(trimmed))
}

When this returns true, we route directly to a silent_acknowledge outcome — no token spent, no latency added. In our deployments this catches 15–25% of incoming messages depending on the surface.

The threshold matters. We cap at ~30 chars to avoid swallowing messages like "ok mais j'ai besoin de…" where the ACK is a discourse marker, not the whole message. Tune this conservatively — a false positive here means a real signal gets silently dropped. Better to send too much to the LLM than too little.

Layer 2 — forced tool calling

This is the core of the pattern. We use Anthropic's tool_choice: { type: 'any' } to force the model to call exactly one tool per turn. Free-text output is structurally disallowed by the API.

const response = await anthropic.messages.create({
  model: 'claude-sonnet-4-6',
  max_tokens: 1024,
  tool_choice: { type: 'any' },
  tools: [
    { name: 'silent_acknowledge', description: '...', input_schema: { ... } },
    { name: 'respond_procedure', description: '...', input_schema: { ... } },
    { name: 'respond_redirect', description: '...', input_schema: { ... } },
    { name: 'escalate_question', description: '...', input_schema: { ... } },
    { name: 'escalate_perturbation', description: '...', input_schema: { ... } },
    // ... typically 8–12 tools per agent
  ],
  system: SYSTEM_PROMPT_WITH_RULES, // cached
  messages: conversationHistory,
})

The palette is closed. Adding a new behavior means shipping a new tool, with a new schema, a new template, and (ideally) new evals. Drift cannot happen by accident.

Designing the palette

The palette design is where most of the engineering work goes. A few principles that survived production:

Distinguish "respond" from "escalate" from "stay silent". These are three categorically different outcomes from the user's perspective: the system answered me, the system told me a human will reach me, the system did nothing visible. They should never be one tool with a string argument that decides between them — the LLM will mis-route.

One tool per response template. If you have two ways of saying "the procedure is X", create two tools. The structured args (category, severity, next_step) live inside each tool, but the kind of response is encoded in the tool name. This makes routing decisions inspectable and templatable.

Tools should have non-overlapping descriptions. A description that says "use this tool when the user asks about X or Y or Z" almost always overlaps another tool. The model picks somewhat randomly between overlapping tools — exactly the case where you'll see drift week to week. Each tool's description should describe a domain that no other tool covers.

Keep the tool count below ~15. Beyond that, the model starts confusing itself and selection accuracy drops. If you need more, you almost certainly need a two-stage routing: a meta-tool ("which family of behaviors does this fall under?") then a specific tool within that family.

What stays in the system prompt

The system prompt holds the rules that govern which tool to pick, never the prose that goes back to the user. Typical content:

  • The agent's role and the boundary of its authority
  • The tool-selection rules ("if the user mentions a missing document, prefer X over Y")
  • Safety floors ("if the user mentions self-harm, ignore the palette and escalate immediately to a human via escalate_emergency")
  • The hard rules: never claim to be human, never invent times, never sign as a person

Mark this whole block with cache_control: ephemeral so repeat calls within 5 minutes hit the cache. We see ~75–90% cache-hit rate on the static rules block in steady-state production traffic.

Layer 3 — generating the user-facing prose

When the LLM picks respond_procedure({ category: 'document_missing', sub_type: 'identity' }), the code picks the template and fills it. Not the LLM.

const TEMPLATES = {
  respond_procedure: {
    document_missing: {
      identity: ({ ctx }) =>
        `To finalize, we still need a copy of your ID document. You can upload it directly via your account, in Profile > Documents. If you need help locating the right document, reply "how to find ID" and I'll send the steps.`,
      contract: ({ ctx }) =>
        `...`,
    },
    schedule_change: {
      // ...
    },
  },
  respond_redirect: {
    pricing_question: ({ ctx }) =>
      `For pricing details, ${ctx.accountManagerFirstName} from the team will reach out — they have the full context on your account.`,
  },
}

function render(toolName: string, args: any, ctx: Context): string {
  const path = [toolName, args.category, args.sub_type].filter(Boolean)
  const template = path.reduce((acc, k) => acc?.[k], TEMPLATES)
  if (typeof template !== 'function') {
    throw new TemplateNotFound({ path, args }) // logged + escalated
  }
  return template(ctx)
}

Five things this gets you for free:

  1. Versioning. Templates are code. Git history shows when copy changed and why.
  2. Reviewability. A non-engineer can read all possible replies in one file. Try doing that with a free-text LLM.
  3. A/B testing. Two templates per outcome, randomized at render time, measured downstream.
  4. i18n. Localize the templates; the LLM stays the same.
  5. Structural impossibility of hallucination. The set of strings the user can receive is finite and committed.

The trade-off: tone variety is lower. A free-text chatbot has thousands of ways to phrase "we'll get back to you". A templated one has the few you wrote. In our experience this is a net positive — consistency reads as professionalism in ops contexts. For a marketing chatbot it might bite.

Layer 4 — the safety re-check on silent paths

The most dangerous failure of the forced-tool architecture is the false silence. The user writes:

"Hi, I'm sick today, can't make it. By the way thanks for the schedule, looks good 🙏"

The LLM sees the polite tail and routes the whole message to silent_acknowledge. The "I'm sick today" goes nowhere. No escalation. No reply. The operator finds out at 8am.

This is the failure mode you cannot eliminate at Layer 2 alone — even a very good Sonnet sometimes weighs the polite tail too heavily.

The fix: when the chosen tool is silent_acknowledge, run a second model (we use Haiku 4.5) on the same input with one specific question: is there an actionable signal in this message that the first model just suppressed?

if (chosenTool === 'silent_acknowledge') {
  const safetyCheck = await anthropic.messages.create({
    model: 'claude-haiku-4-5-20251001',
    max_tokens: 256,
    tool_choice: {
      type: 'tool',
      name: 'classify',
    },
    tools: [{
      name: 'classify',
      description: 'Classify whether the message contains an actionable signal.',
      input_schema: {
        type: 'object',
        properties: {
          actionable: { type: 'boolean' },
          signal: {
            type: 'string',
            enum: ['urgent', 'absent', 'reschedule', 'medical', 'none'],
          },
          confidence: { type: 'number' },
        },
        required: ['actionable', 'signal', 'confidence'],
      },
    }],
    system: SAFETY_RECHECK_SYSTEM, // small, cached
    messages: [{ role: 'user', content: userMessage }],
  })

  const result = parseToolUse(safetyCheck)
  if (result.actionable && result.confidence > 0.6) {
    return overrideToEscalation(result.signal)
  }
}

Haiku is the right model here for three reasons. It's cheap (the call costs ~10× less than Sonnet). It's fast (~300ms). And it's a different model from Sonnet — its biases on what counts as "actionable" are not the same. This last point is what gives the layer its actual value: if you re-checked with the same model, you'd just be paying for a second rendering of the same mistake.

In our deployments this catches 0.5–2% of "silent" decisions, depending on the surface. That's not many. But every one of those is a critical message that would have been silently dropped — and the operational cost of one of them is much higher than the cost of all the Haiku calls combined.

Observability — every decision emits an event

You cannot operate this pipeline blind. Every decision emits a structured event:

posthog.capture({
  event: 'agent_decision',
  properties: {
    agent: 'support_chat',
    flow: 'inbound_message',
    layer: 'sonnet_forced_tool', // 'regex_gate' | 'sonnet_forced_tool' | 'haiku_safety'
    tool_called: 'respond_procedure',
    template_path: 'respond_procedure.document_missing.identity',
    silent: false,
    model: 'claude-sonnet-4-6',
    cache_read_tokens: 5832,
    cache_write_tokens: 0,
    input_tokens: 412,
    output_tokens: 87,
    latency_ms: 950,
    user_org_id: ctx.orgId,
    decision_id: nanoid(),
  },
})

What this gives you in the dashboard:

  • Tool selection drift: tool A was called 40% of the time last week, 25% this week — something changed.
  • Cache hit rate: if it drops, prompt order or template format probably changed.
  • Silent override rate: how often does Haiku catch a Sonnet miss? If it spikes, Sonnet's selection is degrading.
  • Latency budget: regex (1ms) → Sonnet (700–1200ms) → Haiku (300ms).
  • Cost per decision computed live, summed per org per day for billing or alerting.

This is the single most valuable change after shipping the architecture. You go from "did the agent do something weird this week?" to "tool X is selected 12% more often, here's why."

When NOT to use this pattern

Forced tool calling adds design overhead. It's not free. We do not use it for:

  • Pure information chat with no consequences (a docs assistant, an internal Q&A bot). Free-text LLM with citations is faster to ship and the failure modes are visible — wrong answer is wrong, you can grep transcripts.
  • Creative writing assistants. The whole point is the prose.
  • Demos and prototypes. Don't optimize for safety on something that may never ship.
  • Single-turn classifiers that don't reply to the user. You're already in tool-call territory; you don't need the template layer.

The pattern shines specifically when a wrong sentence is expensive: ops chats, customer support with policy implications, regulated domains, voice agents that represent the company on the phone, anything where the user might quote you back to a court or a journalist.

What it costs to design

Honestly: more time than a free-text chatbot. The work breaks down roughly as:

  • Tool palette design and arg schemas: 2–4 days
  • Templates (writing, reviewing with stakeholders, i18n): 3–7 days
  • The two-layer pipeline + safety re-check: 1–2 days
  • Eval set covering tool selection accuracy: 1–2 days
  • Observability wiring: half a day

So 1.5–3 weeks of design and implementation work upfront. The payoff: nine to twelve months later, when the system has handled tens of thousands of conversations, you have not had a single "the agent promised X" incident. The bug-fix cost we'd have paid without this pattern would have eaten that upfront budget many times over.

Closing thoughts

The mental shift is small but important: stop trying to make a free-text LLM safe, and start treating the LLM as a structured-output classifier whose outputs are rendered by code. The LLM is excellent at picking from a closed set given context. It is bad at not bending rules in prose. Use it for the first thing; don't use it for the second.

If you're shipping a conversational agent in a context where wrong sentences are expensive, this is the architecture we'd reach for. If you'd like a sanity check on a design that's drifting toward "yet another rule in the system prompt", get in touch. We've designed this pipeline a few times now — we can do it for you faster.


Travailler avec Ikki

Besoin d'aide pour livrer ça en production ?

On conçoit, livre et opère des systèmes IA pour PME et entreprises. Agents vocaux, RAG, automatisation, web & mobile.

Autres articles

SHIP LOG

SHIP-0247·CODEMACHIA·v1.4.22026-05-08 13:21 UTC