CASE

How we cut a client's LLM bill 71% with prompt caching

Real client, real numbers — $40K/month down to $11.5K/month with no regression in output quality. Here's the diagnosis, the rewrite, and the three things we'd do differently next time.

READ · 7 MIN UPDATED · 2026-04-15 STACK · CLAUDE SONNET 4.6

The setup

The client runs a B2B platform that generates structured summaries of uploaded customer documents. Volume averages 18,000 documents/day. The original implementation called Claude Sonnet 4.6 with a ~6,000-token system prompt explaining the output schema, plus the document content (median ~3,000 tokens), and asked for a JSON summary back.

Math at standard rates: 9K input × $3/Mtok = $0.027 input + ~$0.03 output = ~$0.057/call. At 18K calls/day × 30 days = 540K calls/month × $0.057 = ~$30,800. Plus weekend traffic and re-runs, the actual monthly bill came in at $40,200.

The diagnosis

The 6,000-token system prompt was the killer. It was identical on every single call, but the team was paying full input rate to send it every time. Prompt caching exists exactly for this — Claude charges 10% of input rate on cache reads, which would drop the system prompt cost by 90%.

The rewrite

Three changes:

  1. Move the system prompt into a cache block. One cache_control: ephemeral marker on the system message, 30 lines of code total.
  2. Move stable schema docs into the cache too. The schema spec was another ~1,500 tokens reused on every call.
  3. Switch async workloads (75% of volume) to the Batch API for the additional 50% discount. Existing async pipeline already had a queue, so this was minor.

The new math

After the rewrite:

540K calls × $0.024 = $11,520/month. Down from $40,200. 71% reduction, no quality regression on the eval set, two weeks of engineering work to ship.

What we'd do differently

  1. Cache from day one, not as an optimisation. Treat caching as a first-class architectural decision, not a tuning step. We've never seen a workload with a stable prefix where caching wasn't worth it.
  2. Build the cost dashboard before the feature. The client noticed the bill at $40K. With per-feature cost breakdowns visible from week one, they would have asked the question at $5K.
  3. Right-size more aggressively. A few of the simpler document types didn't need Sonnet — they would have been fine on Haiku 4.5 ($1/$5 vs $3/$15) for another 30-40% saving on that slice of traffic.

Cross-references

Suspect your AI bill is 2x what it should be? It probably is.

BOOK A COST AUDIT → RUN THE NUMBERS →