The problem
Client ran a B2B SaaS platform for ~80K active users. Support volume had grown 3x in 18 months while headcount had grown 1.4x. The existing stack — Zendesk routing, Zapier-driven escalations, manual tier-1 triage by human agents — was burning the team out and tier-1 deflection was stuck at 31% despite a comprehensive help centre.
The brief: rebuild support so most tier-1 issues resolve without a human, escalations land on the right agent without re-routing, and response quality matches or exceeds the human baseline. Three-week ship, no headcount cut on day one — they wanted to redeploy capacity to higher-tier work.
The diagnosis
We spent the first three days reading the last 6 months of tickets and the team's response patterns. Two things stood out:
- Most tier-1 tickets were product questions answerable from existing docs — but the docs lived in three places and the help-centre search was bad.
- The escalation routing was lossy — about 22% of escalations got bounced between two teams before landing right.
What we shipped
One router + three specialist agents, all on Claude Sonnet 4.6.
- Router agent — classifies the ticket into one of 9 buckets, routes to the right specialist or directly to a human queue. Average 380ms decision, 96% routing accuracy on the eval set.
- How-to agent — handles "how do I do X" tickets by retrieving from the consolidated docs corpus and producing an answer. Includes inline links to the source docs.
- Billing agent — handles billing-related tickets by looking up the customer's account state via internal API, then producing a contextual answer. Has explicit guardrails for any action that would change account state (always escalates).
- Triage-summary agent — for anything escalated to a human, generates a 2-3-sentence summary of the issue + relevant context. Cuts the human's "read the thread" time meaningfully.
All four use the same prompt-cached system prompt (~5K tokens of product context + behavioural guardrails). Cache hit rate stays above 92% in production. Deployed on Deno Deploy for the API surface, Postgres for the eval and conversation logs, Sentry for errors, Datadog for cost dashboards.
What we evaluated continuously
- Routing accuracy on a frozen 200-ticket eval set (re-runs nightly).
- Human acceptance rate of agent-drafted responses (auto-flagged when an agent edits more than 20% of the suggested response).
- Customer thumbs-up/thumbs-down on each resolved ticket.
- Per-agent monthly cost (with anomaly alerts on 1.5x-of-median spikes).
What we'd do differently
- Start with the eval set. We built it in week 2; it should have been week 1, day 1. Without it we couldn't have caught two regressions during the build.
- Right-size more aggressively. The router agent could probably run on Haiku 4.5 (it's a classification task at heart). Would save another ~25% on its cost line.
- Ship the cost dashboard before shipping the agents. Same lesson as our LLM bill case study: visibility before deployment, not after.
Stack
- Model: Claude Sonnet 4.6 for everything except the triage-summary agent (Haiku 4.5).
- Compute: Deno Deploy edge runtime.
- Storage: Postgres on Neon for evals + conversation logs.
- Observability: Datadog for cost + latency dashboards, Sentry for errors.
- Eval harness: A small in-house tool that replays the frozen eval set against any prompt or model change. ~80 lines of Python.