Prompt Caching Savings Calculator

See how much prompt caching saves on your API bill. When many requests share a long, identical prefix — a system prompt, a document, a few-shot block — providers can cache it and charge a fraction to reuse it. Enter your cached and fresh token counts, your monthly volume, and your cache hit rate to compare the cost with and without caching, and the break-even. All math runs locally.

Input cost without caching
Input cost with caching
Monthly savings
Savings

Covers input-token cost only — output tokens are billed the same either way. Multipliers are applied to the base input price. Prices change; verify against your provider's current rate card.

How to use the Prompt Caching Savings Calculator

Pick a provider scheme to load its caching multipliers, or choose Custom and enter them yourself. The base input price is your model's normal per-million-token input rate; the write and read multipliers scale it for caching a prefix and for reusing one. Then describe your workload: how many tokens are in the shared prefix (the part that repeats), how many are unique per request, your monthly request count, and the share of requests that hit a warm cache.

The result compares input cost with and without caching and shows the monthly saving in both dollars and percent. The bigger your shared prefix relative to the fresh part, and the higher your hit rate, the more caching wins. If your prefix is small or your hit rate is low, caching can even cost slightly more because of the write surcharge — the calculator makes that trade-off visible.

How prompt caching pricing works

Large prompts often repeat. A coding agent resends the same system prompt and tool definitions on every turn; a document-QA app resends the whole document for each question; a few-shot classifier resends the same examples. Without caching you pay full input price for those tokens every single time. Prompt caching lets the provider store the processed prefix so subsequent requests that begin with the same tokens are billed at a steep discount — the model skips re-reading the cached part.

The pricing has two parts. Writing to the cache (the first request, a miss) can carry a small surcharge — Anthropic charges about 1.25x the base input rate for a short-lived cache write — because the provider does extra work to store the prefix. Reading from the cache (subsequent hits) is heavily discounted: roughly 0.1x base on Anthropic, 0.5x on OpenAI's automatic caching, and around 0.25x for Gemini's context caching. OpenAI applies caching automatically with no write surcharge; Anthropic requires you to mark cache breakpoints explicitly and offers both 5-minute and 1-hour cache lifetimes at different write prices.

The economics hinge on three numbers: how large the shared prefix is, how often you reuse it before it expires (the hit rate), and the write-versus-read spread. A long prefix reused thousands of times approaches the read multiplier — on Anthropic that is a roughly 90% input saving. A short prefix, or one that expires before reuse, barely helps and the write surcharge can make it marginally more expensive. This calculator models exactly that so you can decide whether to restructure a prompt to maximize the cacheable prefix.

Common use cases

  • Justifying a caching rollout. Put a monthly dollar figure on enabling caching before you change any code.
  • Structuring prompts. See how much you gain by moving stable content to the front so more of the prompt is cacheable.
  • Comparing providers. Weigh Anthropic's deep read discount with a write surcharge against OpenAI's automatic 50% cached rate.
  • Setting hit-rate targets. Find the hit rate at which caching breaks even for your prefix size.

Frequently asked questions

Does caching reduce output token cost?

No. Caching only discounts input (prompt) tokens — the part that repeats. Output tokens are generated fresh every time and billed at the normal output rate, so this calculator covers input cost only. Add your output cost separately for a full bill estimate.

Why might caching cost more?

If your shared prefix is small or your hit rate is low, the write surcharge on cache misses can outweigh the read savings. Caching pays off when a substantial prefix is reused many times before it expires; for one-off prompts it is not worth it. The calculator shows when you are on the wrong side of that line.

What is a realistic cache hit rate?

It depends on traffic patterns and cache lifetime. A busy agent reusing a system prompt within the cache window can exceed 90%. Bursty or low-volume workloads where the cache expires between requests see much lower rates. Anthropic offers a longer 1-hour cache (at a higher write price) to raise hit rates for slower traffic.

How do I maximize the cacheable prefix?

Put everything stable at the very start of the prompt — system instructions, tool schemas, few-shot examples, reference documents — and place the user's variable input last. Caching matches on a common leading prefix, so any change near the front invalidates the cache for everything after it.

Are these prices current?

The multipliers reflect each provider's published caching model, but base prices and exact discounts change over time. Enter your model's current base input price and verify the multipliers against the provider's rate card for a precise figure.