The setup
The client runs a B2B platform that generates structured summaries of uploaded customer documents. Volume averages 18,000 documents/day. The original implementation called Claude Sonnet 4.6 with a ~6,000-token system prompt explaining the output schema, plus the document content (median ~3,000 tokens), and asked for a JSON summary back.
Math at standard rates: 9K input × $3/Mtok = $0.027 input + ~$0.03 output = ~$0.057/call. At 18K calls/day × 30 days = 540K calls/month × $0.057 = ~$30,800. Plus weekend traffic and re-runs, the actual monthly bill came in at $40,200.
The diagnosis
The 6,000-token system prompt was the killer. It was identical on every single call, but the team was paying full input rate to send it every time. Prompt caching exists exactly for this — Claude charges 10% of input rate on cache reads, which would drop the system prompt cost by 90%.
The rewrite
Three changes:
- Move the system prompt into a cache block. One
cache_control: ephemeralmarker on the system message, 30 lines of code total. - Move stable schema docs into the cache too. The schema spec was another ~1,500 tokens reused on every call.
- Switch async workloads (75% of volume) to the Batch API for the additional 50% discount. Existing async pipeline already had a queue, so this was minor.
The new math
After the rewrite:
- Cache write: 7,500 tokens × $3.75/Mtok (cache-write rate) = $0.028 — but this only happens on cache miss, which is roughly 1 in 400 calls (cache TTL is 5 minutes, traffic stays warm).
- Cache read: 7,500 tokens × $0.30/Mtok = $0.0023.
- Document content: 3,000 tokens × $3/Mtok = $0.009.
- Output: ~$0.03 unchanged.
- Per-call total: ~$0.041 → with batch discount on 75% of volume → blended ~$0.024/call.
540K calls × $0.024 = $11,520/month. Down from $40,200. 71% reduction, no quality regression on the eval set, two weeks of engineering work to ship.
What we'd do differently
- Cache from day one, not as an optimisation. Treat caching as a first-class architectural decision, not a tuning step. We've never seen a workload with a stable prefix where caching wasn't worth it.
- Build the cost dashboard before the feature. The client noticed the bill at $40K. With per-feature cost breakdowns visible from week one, they would have asked the question at $5K.
- Right-size more aggressively. A few of the simpler document types didn't need Sonnet — they would have been fine on Haiku 4.5 ($1/$5 vs $3/$15) for another 30-40% saving on that slice of traffic.
Cross-references
- For Claude pricing details: our Claude review covers the full tier breakdown and the prompt-caching math in depth.
- Run your own numbers with the LLM cost calculator — adjust the cache hit rate slider to model your prefix stability.
- The Claude vs ChatGPT trade-off lands a different way at this scale: see Claude vs ChatGPT for our take.