Platforms·May 9, 2026·10 min read

Prompt Caching with Claude — What `cache_control: ephemeral` Actually Saves

Anthropic prompt caching can cut your bill 80–95% on the right shapes. It can also do nothing at all if you mis-order your blocks. The patterns, the pitfalls, and the numbers from production.

Ikki

Last verified · May 9, 2026

Prompt Caching with Claude — What `cache_control: ephemeral` Actually Saves

TL;DR

Prompt caching with Claude is the highest-leverage cost optimization available on Anthropic's API in 2026. Done right, it cuts the input-token bill on long, stable contexts by 80–95% and shaves 20–40% off median latency. Done wrong, it does literally nothing.

Across the agents we run in production, the difference between "caching works" and "caching does nothing" is almost always one of three mistakes: blocks in the wrong order, the wrong content marked as cached, or expecting a hit across a TTL boundary that has already passed.

This article is the production-tested mental model.

The pricing shape that makes caching worth it

Anthropic's caching has two prices and one TTL:

Cache write: 1.25× the standard input price (you pay a small premium to write the cache the first time)
Cache read: 0.1× the standard input price (a 90% discount on the cached portion)
Default TTL: 5 minutes from last read (it slides on every hit). An extended 1-hour TTL is available for a higher write multiplier.

The break-even is two reads. If you write a cache block and read it twice within 5 minutes, you've already saved money. Read it ten times, and you've paid 10% of the bill you'd have paid without caching.

For agents that hit the API many times a minute on the same system prompt — chat workers, classifiers, batch pipelines — this is transformative. For one-shot calls that never repeat, it's net-negative.

The 4-block rule

Anthropic exposes a maximum of 4 cache breakpoints per request. Each breakpoint marks the end of a cache block — everything before it (back to the previous breakpoint or the start of the request) is cached as one chunk.

You allocate these four breakpoints across the input. The canonical layout for a chat agent:

┌─────────────────────────────────────────────┐ <- start
│ system prompt (long, stable)                │
│ - role, tools rules, safety floors          │
├─ cache_control: ephemeral ──────────────────┤ <- breakpoint 1
│ tool definitions (stable across the agent)  │
├─ cache_control: ephemeral ──────────────────┤ <- breakpoint 2
│ static context (org config, policy docs)    │
├─ cache_control: ephemeral ──────────────────┤ <- breakpoint 3
│ recent conversation history                 │
│ - last N messages                            │
├─ cache_control: ephemeral ──────────────────┤ <- breakpoint 4
│ current user message (volatile)             │
└─────────────────────────────────────────────┘ <- end (always uncached)

What this gets you in steady-state: the first three blocks are stable for the lifetime of the agent (often hours or days). Reads on those are 90% cheaper. Block 4 is "current conversation up to but not including this turn", which is shared across all retries and re-runs of the same turn but invalidated when a new message comes in. Everything after block 4 is uncached — typically the last user message.

You'll rarely use all 4 in chat. Two or three is more common. The point is to know which boundaries to draw.

The order matters more than people think

The cache key is a prefix hash of the input. If anything before a cache_control breakpoint changes, the breakpoint's hash changes, and the read becomes a write — at the higher price, with no discount.

This is the most common failure mode we see in code review:

// ❌ BROKEN — system prompt rebuilt with timestamp every call
const system = `You are an agent. Today is ${new Date().toISOString()}.

[long static rules block]
`

The Date() interpolation invalidates the cache on every single call. Cache write rate: 100%. Cache hit rate: 0%. The bill goes up (you're paying the 1.25× write multiplier on every request).

The fix is to push volatile content after all static content, so it's downstream of the breakpoint:

// ✅ WORKS
const system = `You are an agent.

[long static rules block]
`

const messages = [
  { role: 'user', content: `Today is ${new Date().toISOString()}.` }, // ephemeral
  // ... rest of the conversation
]

A useful mental check: before every breakpoint, ask "is everything above this 100% identical to the last call?" If not, the breakpoint is wasted.

The illusion of caching

Anthropic returns cache_creation_input_tokens and cache_read_input_tokens on every response. These two numbers are the only ground truth.

const response = await anthropic.messages.create({ ... })
console.log({
  cache_read: response.usage.cache_read_input_tokens,
  cache_write: response.usage.cache_creation_input_tokens,
  input: response.usage.input_tokens,
  output: response.usage.output_tokens,
})

In a healthy production agent on stable traffic, you'll see something like:

{ cache_read: 8421, cache_write: 0, input: 142, output: 87 }

Read tokens are large, write is zero, and standard input tokens reflect only the volatile tail. This is what success looks like.

What we've seen on first-look code that thinks it's caching:

{ cache_read: 0, cache_write: 8563, input: 142, output: 87 }

Write every call, never a read. Either the prefix is unstable, the request shape changed, or 5 minutes have elapsed between calls. Watch this metric like a hawk.

We log both fields on every API call as a PostHog property and build a dashboard with cache_hit_ratio = cache_read / (cache_read + cache_write) segmented by agent and model. When it drops below ~0.6, something has shifted upstream.

Block ordering rules of thumb

After a few production deployments, the ordering rules below survived:

Tools come early. Tool definitions are stable for the agent's lifetime and they're often large (a 10-tool palette can be 3–5k tokens). Mark the end of tools with a breakpoint.
System prompt before tools? After? The Anthropic docs allow either, but in practice we put the system prompt first and tools second. The system prompt is shorter and changes less often than tools (we add tools more frequently than we change the role description). Stable-shorter-first = best cache survival across small edits.
Per-org context (config, docs, plan tier) goes after tools. Stable for the org session, changes per-tenant.
Recent conversation history goes after per-org context. Everything up to "the last message" can sometimes be cached if the conversation is being retried; the very last user message is always uncached.
The current user message goes last and is never marked. It's the volatile tail.

When the 5-minute TTL bites you

The default TTL is 5 minutes from last read. If your traffic isn't dense enough, you'll write the cache, get one or two reads, then watch it expire — and the next call writes again at the 1.25× price.

Three patterns to handle this:

1. Burst-friendly agents (most chat workers): when active traffic is on, multiple reads happen well within 5 minutes; the cache pays for itself many times over. When idle, the cache expires and the next user pays for one write. Net positive almost always.

2. Sparse-traffic agents (a back-office assistant queried 5 times an hour): the 5-minute TTL almost always expires between reads. Writes outnumber reads. Caching is a slight loss. Solution: switch to the 1-hour extended TTL (cache_control: { type: 'ephemeral', ttl: '1h' }), which carries a higher write multiplier (~2×) but stays alive long enough that you actually get reads. This wins on sparse but recurring traffic.

3. One-shot agents (a one-time PDF analysis pipeline): the request never repeats. Don't cache. You'll pay the write penalty for nothing.

The break-even between standard and 1-hour TTL depends on your shape, but the rule of thumb: if your average gap between requests on the same prefix is over 4 minutes and under 50 minutes, switch to 1h. Below 4 minutes, default 5-minute TTL is fine. Over 50 minutes, don't cache at all.

Multi-turn caching — what's actually cached

A subtle point that trips teams up: when the conversation grows, the prefix changes. Each new turn extends the messages array. So the cache prefix that was valid at turn 5 is invalidated at turn 6 — turn 6's prefix is "everything up to turn 5 inclusive", a longer string.

The way Anthropic handles this is that the whole prefix up to the last cache breakpoint is checked against the cache. If you put a breakpoint at the end of turn-5's history, then at turn 6, the API can still hit the cache for everything up to turn 5, and only turn 6's user message is uncached. That's exactly what you want.

Practically: put a breakpoint after the conversation history, not on individual messages. The API will roll the cache forward as the conversation grows, as long as the messages don't mutate.

What breaks this: editing past messages, re-ordering them, adding tool results in arbitrary positions. Each of those mutations invalidates the prefix from that point forward.

Cost worked example

Real numbers from a chat-ops agent we run, for a single call mid-conversation:

Component	Tokens	Without caching	With caching
System prompt	1,200	$0.0036	$0.00036 (cache read)
Tool definitions	2,800	$0.0084	$0.00084 (cache read)
Org context	600	$0.0018	$0.00018 (cache read)
Recent history	3,400	$0.0102	$0.00102 (cache read)
Current message	180	$0.00054	$0.00054
Output	320	$0.0048	$0.0048
Total	—	$0.029	$0.011

Both calls produce the same output. The cached version costs ~38% of the uncached version on this single call. Across a busy day with thousands of calls per agent, this is the difference between an Anthropic bill that scales linearly with seats and one that scales sub-linearly.

The output tokens dominate when the input is small. Caching only helps the input side. So this technique is most powerful when you have large input, small output — exactly the shape of forced-tool-calling agents (large rules + tools, small structured output).

What you can't cache

A few things often confuse teams:

You can't cache output. Output tokens are always priced at standard rate. The optimization is on the input you reuse.
You can't cache across different model IDs. Switching between Sonnet and Haiku invalidates the cache. If you have a multi-model pipeline (Sonnet for routing, Haiku for safety re-check), each model has its own cache budget.
You can't cache across temperature, top_p, or other sampling parameters changes — these are part of the cache key.
Image content cannot be cached at the time of writing (this may change; check Anthropic's release notes). Plan around this if you ship multimodal pipelines.

When prompt caching is the wrong optimization

A few production scenarios where caching adds complexity without much benefit:

Short prompts (under ~1k tokens). The minimum cacheable size means very short prompts can't be cached at all, and even when cacheable the savings are negligible.
High-variance prompts where every call has different content (one-shot summarization, custom report generation). The prefix is rarely identical → no cache hits.
Agents where output dominates input (long-form generation from a short prompt). Output isn't cached, so the optimization moves the needle by ~5%.
Edge cases of regulated workloads where reusing prompts could trigger compliance flags (specific contractual or data-residency requirements). Always check before turning on.

Putting it in production

Three concrete habits we recommend:

Log cache_read_input_tokens and cache_creation_input_tokens on every call, segmented by agent. This is your scoreboard. If you're not measuring it, caching is dark magic — and dark magic always disappoints.
Treat your prompt structure as a stable contract. Touching the system prompt is a release event, not a casual edit. Bumping a version of the cached block invalidates the cache for as long as it takes traffic to refill it. Be deliberate.
Build a "no-cache" canary. Run a small percentage of traffic with caching disabled and compare cost / latency / output quality. Caching shouldn't change quality, but bugs happen — the canary is your alarm.

Closing thoughts

Prompt caching is one of the few API-level levers that genuinely pays back the engineering work. The pattern is simple — long stable prefix, short volatile tail, breakpoint between them, log the metrics — but the failure modes are silent. Most of the wins we've seen come from teams that started measuring cache hit rate, then went and fixed whatever was breaking it.

If you've turned caching on and you're not sure it's actually working, a 30-minute look at your cache_read / cache_write ratio per agent will tell you the answer immediately. If you'd like a second pair of eyes on the prompt structure, get in touch.

Work with Ikki

Paying full price on Claude tokens?

We audit your prompt structure and unlock 40–90% cost reduction via ephemeral cache. Pilot in two days.

Start a project See our work

Agents

The Week Anthropic Claimed the Full Stack

Project Glasswing went to public beta. Stainless — the company behind all Anthropic SDKs — was acquired. Seven agent SDK releases in four days. The platform era is here.

Agents

Six Releases in Eleven Days: What Google's Pre-I/O Sprint Signals

@google/genai shipped Agent and Environment APIs today — days before Google I/O. The SDK velocity tells you what's coming before the keynote does.

Prompt Caching with Claude — What `cache_control: ephemeral` Actually Saves

Paying full price on Claude tokens?

More articles

The Week Anthropic Claimed the Full Stack

Six Releases in Eleven Days: What Google's Pre-I/O Sprint Signals