<link href="/user/plugins/markdown-notices/assets/notices.css" type="text/css" rel="stylesheet"> <link href="/user/plugins/breadcrumbs/css/breadcrumbs.css" type="text/css" rel="stylesheet"> Prompt Caching with Claude API: 7× Cost Reduction in Production | Claude API Engineering Blog

Prompt Caching with Claude API: 7× Cost Reduction in Production

Real benchmarks from 3500+ production requests. Per-model thresholds, 1-hour TTL on Max subscription, the hardcoded zero bug that made cache invisible for a day.

TL;DR. Prompt caching on Claude reads cached input at 10% of the normal token price. With a stable system prompt and bursty traffic, that means 5–7× cost reduction. The catch: the minimum size to cache is not 1024 tokens for every model — Haiku and Opus need ~4096. The Anthropic API reference will tell you 1024; that's the Sonnet number. We hardcoded zeroes in our usage logger and it took a full day of zeroed cache columns to notice. Below: how the pricing math actually works, the per-model thresholds, the 1-hour TTL we get on Max plans, the bug, and what we measure now.

What prompt caching does

Caching changes how Anthropic prices repeated input. You mark a chunk of your prompt with cache_control: {type: "ephemeral"} and on the next call within the TTL window, Claude reads that chunk from the cache instead of re-processing it. Cache reads are billed at 10% of the normal input price. Cache writes are 25% more than normal input — you pay a premium once, then get the discount on every subsequent hit.

The math is brutally simple:

Normal call:  N tokens × full input price
Cached call:  N tokens × 10% input price + (one-time) write premium 25%
Break-even:   the second cached call already saves money

If your system prompt is 8,000 tokens and you call Claude 50 times in an hour with the same system prompt, you go from 50 × 8000 × full_price to roughly 1 × 8000 × 1.25 + 49 × 8000 × 0.10. That's 5.5× cheaper. We see closer to 7× in practice because the cached portion includes not just the system prompt but also conversation context that's stable across turns (RAG retrievals, tool definitions, few-shot examples).

The point is: caching isn't optional at our scale. A 7× discount on input is what makes pooled-subscription economics work for everyone.

Per-model thresholds (where the docs lie)

The Anthropic API reference says:

"Minimum cacheable prompt length: 1024 tokens."

That's true for Sonnet. It's wrong for the other models. From the Cookbook and from our own measurements:

Model Min characters Min tokens (~) Notes
claude-sonnet-4-6 4,500 ~1,024 Anthropic's published number
claude-haiku-4-5 17,000 ~4,096 4× the Sonnet floor
claude-opus-4-7 17,000 ~4,096 Same as Haiku, despite being top-tier

Below the threshold, marking a block with cache_control is a no-op. The block is sent, ignored as a cache candidate, and you pay full price every time. Worse: you don't get an error. The response just lacks cache_creation_input_tokens and cache_read_input_tokens. Silent failure mode.

We learned this trying to cache a 1,200-token system prompt on Haiku. Hours of dashboards with cc=0, cr=0. Same prompt on Sonnet: cc=1208, cr=0 first call, cc=0, cr=1208 thereafter. The 4× Haiku threshold is documented in the Anthropic Cookbook, which we should have read first.

# What we ship now (excerpt from session_pool.py)
CACHE_MIN_CHARS = {
    'sonnet': 4500,    # ~1024 tokens
    'haiku':  17000,   # ~4096 tokens
    'opus':   17000,   # ~4096 tokens
}

def maybe_cache(model: str, system_prompt: str) -> dict:
    family = next((k for k in CACHE_MIN_CHARS if k in model), None)
    if family and len(system_prompt) >= CACHE_MIN_CHARS[family]:
        return {
            'type': 'text',
            'text': system_prompt,
            'cache_control': {'type': 'ephemeral'},
        }
    return {'type': 'text', 'text': system_prompt}

If the system prompt is shorter than the model's threshold, we don't add cache_control — there's no point paying the 25% write premium for something that won't be cached.

TTL: 5 minutes by default, 1 hour on Max

Cache entries expire after 5 minutes by default — the clock resets every time the cache is read. So a high-traffic system prompt stays warm indefinitely. A sleepy one needs to be hit at least once every 5 minutes to survive.

For accounts on the Claude Max plan — which is what every account in our OAuth pool uses — the TTL is 1 hour, automatically. No header to set, no ENABLE_PROMPT_CACHING_1H, no opt-in. The platform just gives Max users a longer window. That changed the math for us: a system prompt called once an hour is now cacheable across calls, where on a default plan it would expire and re-write.

(Technically there's a header to bump the TTL on the Console API tier — anthropic-beta: prompt-caching-2024-07-31 — but Max accounts get 1 hour without asking.)

The hardcoded-zero bug

Here's the embarrassing one. Our usage logger pulled cache stats from the response like this:

# OLD CODE — broken for ~24 hours, ~3500 requests
log_record = {
    'cache_creation_input_tokens': 0,  # placeholder, fix later
    'cache_read_input_tokens': 0,      # placeholder, fix later
    ...
}

Those zeroes were placeholders during the initial integration, before caching was enabled. When we turned caching on, we forgot to wire the real values from msg.usage. Result: every request on April 28th has cc=0, cr=0 in our database, even when caching was working perfectly at the API level. The bills came in lower; the dashboards stayed flat.

Fix:

# NEW CODE
log_record = {
    'cache_creation_input_tokens': getattr(msg.usage, 'cache_creation_input_tokens', 0) or 0,
    'cache_read_input_tokens': getattr(msg.usage, 'cache_read_input_tokens', 0) or 0,
    ...
}

getattr with a default and or 0 because the SDK can return None or skip the field entirely on calls below the cache threshold. Lesson: never ship placeholder zeroes to a logger. Use None and a hard schema constraint, so the bug surfaces as a NULL violation instead of silently lying for a day.

How caching interacts with our OAuth pool

We don't talk to api.anthropic.com directly. Every request goes through one of 5 Max-plan accounts via OAuth, rotated by load. That's its own blog post — short version: instead of paying $3 per million Sonnet input tokens, we pay 5 × $20/month for unlimited-within-rate-limits subscription access.

Caching layers on cleanly. The cache key is per-account (each Max plan is its own customer to Anthropic), so when we round-robin requests across the pool, each account warms its own cache. With 5 accounts and a stable system prompt, each account independently maintains a cached prefix; cache hit rates per account stabilize at ~60–80% within an hour of warming up.

The wrinkle: Sonnet and Opus over OAuth require a specific identity prefix as the first system block:

'system': [
    {'type': 'text', 'text': "You are Claude Code, Anthropic's official CLI for Claude."},
    {'type': 'text', 'text': user_system_prompt, 'cache_control': {'type': 'ephemeral'}},
]

The identity goes first without cache_control. The user's actual system prompt goes second with cache_control. Anthropic caches the cumulative prefix, so the identity chunk is implicitly cached after the first call, but you don't get billing visibility on it (it's part of the OAuth tier price you've already paid).

This setup avoids issue #35269 where missing-identity OAuth requests get 401'd at random.

What "cost" means on a subscription

This trips up new users of the platform. On metered billing, "cost per request" is a real dollar amount — input tokens × price + output tokens × price. On a Max subscription, your real cost is fixed at $20/month and the per-request "cost" reported by the SDK is 0 (zero). That's not a bug. The Anthropic SDK's total_cost_usd field is documented as a client-side estimate; on subscription tiers there's nothing to estimate.

We bill our customers by computing what their requests would have cost on metered billing — using the official Anthropic price list — and applying our markup. That number is meaningful (it's what it would have cost them to do this themselves). The number from the SDK is meaningless. We keep both: cost_usd (our calculated billable cost) and raw_cost_usd (SDK's number, always 0). The first one drives invoices. The second one is reserved for the day someone wants to display "you saved X% by using us."

What we measure now

The dashboard we ship to ourselves and customers tracks:

  1. Cache hit rate by modelcache_read_input_tokens / total_input_tokens. Stable target: >40% on Sonnet workloads, >60% on RAG-heavy Haiku workloads where the retrieval context dominates.
  2. Average cached prefix size — how big the cached chunk is on average. If this drops, someone's varying the system prompt unintentionally (timestamps, IDs, anything dynamic at the top of the prompt kills caching).
  3. Cache write spike alerts — if cache_creation_input_tokens jumps for an account that was previously warm, the cache got evicted (TTL elapsed, or new prompt) and we want to know.
  4. Cost-with-cache vs. cost-without — same workload, both numbers, ratio. Currently averaging 6.8× across all accounts.

We email this report at 21:30 Kyiv (18:30 UTC) every day to the team Telegram. Anomalies get flagged.

Things that broke caching (debugging cheat sheet)

When the cache hit rate suddenly drops, in order of likelihood:

  • Dynamic content above cache_control. Even a single timestamp in the system prompt invalidates the entire cache. Move dynamic data into the user message, keep system prompts stable.
  • The block became too small. If you trim a system prompt below the per-model threshold, caching silently turns off.
  • Wrong model thresholds. Already covered. Don't trust the API reference page for anything but Sonnet.
  • OAuth account rotation thrash. If you round-robin too fast across cold accounts, you pay write premiums on every account before any of them warm up. We cluster requests by user/conversation to maximize per-account locality.
  • 5-minute TTL on default plans. If your traffic is bursty and your plan isn't Max, the default TTL doesn't survive natural quiet periods.

What's next

We're working on auto-segmentation: automatically detecting which parts of incoming prompts are stable across calls and inserting cache_control markers without user intervention. The naive version (split on the last user message) works for chat workloads but breaks for tool-heavy agent prompts where tool definitions appear after dynamic context. Probably a separate post once we ship it.

If you're building on Claude and not caching aggressively, you're paying 5–7× more than you should be. The setup is two lines of code; the ROI is immediate; the only real risk is the silent-failure mode when you're below the cache threshold.

Post about how we run the OAuth pool itself coming next.


If you want metered access to Claude that's already cache-optimized and pool-managed, that's exactly what Claude API Platform is. Sign up gets you $5 of free credits.

<script src="/user/plugins/highlight/js/highlight.pack.js"></script> <script> hljs.initHighlightingOnLoad(); </script>