Rebuilt a SaaS support stack from scratch in 3 weeks

The problem

Client ran a B2B SaaS platform for ~80K active users. Support volume had grown 3x in 18 months while headcount had grown 1.4x. The existing stack — Zendesk routing, Zapier-driven escalations, manual tier-1 triage by human agents — was burning the team out and tier-1 deflection was stuck at 31% despite a comprehensive help centre.

The brief: rebuild support so most tier-1 issues resolve without a human, escalations land on the right agent without re-routing, and response quality matches or exceeds the human baseline. Three-week ship, no headcount cut on day one — they wanted to redeploy capacity to higher-tier work.

The diagnosis

We spent the first three days reading the last 6 months of tickets and the team's response patterns. Two things stood out:

Most tier-1 tickets were product questions answerable from existing docs — but the docs lived in three places and the help-centre search was bad.
The escalation routing was lossy — about 22% of escalations got bounced between two teams before landing right.

What we shipped

One router + three specialist agents, all on Claude Sonnet 4.6.

Router agent — classifies the ticket into one of 9 buckets, routes to the right specialist or directly to a human queue. Average 380ms decision, 96% routing accuracy on the eval set.
How-to agent — handles "how do I do X" tickets by retrieving from the consolidated docs corpus and producing an answer. Includes inline links to the source docs.
Billing agent — handles billing-related tickets by looking up the customer's account state via internal API, then producing a contextual answer. Has explicit guardrails for any action that would change account state (always escalates).
Triage-summary agent — for anything escalated to a human, generates a 2-3-sentence summary of the issue + relevant context. Cuts the human's "read the thread" time meaningfully.

All four use the same prompt-cached system prompt (~5K tokens of product context + behavioural guardrails). Cache hit rate stays above 92% in production. Deployed on Deno Deploy for the API surface, Postgres for the eval and conversation logs, Sentry for errors, Datadog for cost dashboards.

What we evaluated continuously

Routing accuracy on a frozen 200-ticket eval set (re-runs nightly).
Human acceptance rate of agent-drafted responses (auto-flagged when an agent edits more than 20% of the suggested response).
Customer thumbs-up/thumbs-down on each resolved ticket.
Per-agent monthly cost (with anomaly alerts on 1.5x-of-median spikes).

What we'd do differently

Start with the eval set. We built it in week 2; it should have been week 1, day 1. Without it we couldn't have caught two regressions during the build.
Right-size more aggressively. The router agent could probably run on Haiku 4.5 (it's a classification task at heart). Would save another ~25% on its cost line.
Ship the cost dashboard before shipping the agents. Same lesson as our LLM bill case study: visibility before deployment, not after.

Stack

Model: Claude Sonnet 4.6 for everything except the triage-summary agent (Haiku 4.5).
Compute: Deno Deploy edge runtime.
Storage: Postgres on Neon for evals + conversation logs.
Observability: Datadog for cost + latency dashboards, Sentry for errors.
Eval harness: A small in-house tool that replays the frozen eval set against any prompt or model change. ~80 lines of Python.

Rebuilt a SaaS support stack from scratch in 3 weeks

TIER-1 DEFLECTION

AVG RESOLUTION TIME

NET COST CHANGE

ON-CALL PAGES (60d)