Batch API & Caching Savings Calculator

See how much a Batch API and prompt caching cut your LLM bill. The major providers discount asynchronous batch jobs by about 50%, and caching repeated prompt prefixes can save up to 90% on input tokens. Enter your monthly token volume and prices, choose a provider preset, and compare the standard, batched, and cached monthly cost side by side. All calculation runs in your browser.

Batch and caching are usually separate optimizations: batch suits large asynchronous jobs, caching suits repeated prompt prefixes in interactive use. The cached figure shows steady-state read savings and excludes one-time cache-write costs (which some providers charge). Prices are representative and editable.

How to use the Batch API & Caching Savings Calculator

Pick a provider preset to load representative prices, the batch discount, and the cache-read multiplier, or choose "Custom" and enter your own. Set your monthly input and output token volumes in millions. The result compares three scenarios: paying standard rates, running the same work through the Batch API at the discounted rate, and serving a fraction of your input tokens from the prompt cache at the reduced multiplier. Each scenario shows its monthly cost and the savings against the standard price.

Use the cached-input-fraction field to model how much of your prompt is a stable, reusable prefix — a long system prompt or shared document that repeats across calls is a good caching candidate, while unique user input is not. The higher the reused fraction, the larger the caching saving.

How batch discounts and prompt caching save money

A Batch API trades latency for price. Instead of answering in real time, you submit a file of requests and the provider processes them within a window (typically up to 24 hours), filling spare capacity. In return, OpenAI, Anthropic, and Google all discount batch jobs by about 50% on both input and output tokens. For any workload that does not need an immediate response — bulk classification, embeddings, evaluation runs, content generation pipelines — that is a straight halving of cost for the same model and quality.

Prompt caching attacks a different cost. When many requests share a long common prefix — a detailed system prompt, a style guide, a retrieved document — the provider can cache the processed prefix and skip recomputing it. Cached input tokens are billed at a steep discount: OpenAI charges 50% of the input rate for cache hits, Gemini about 25%, and Anthropic as little as 10% (a 90% saving) for reads, in exchange for a small premium on the initial cache write. Caching only helps the input side and only when the prefix actually repeats, but for prompt-heavy applications it can dominate the savings.

The two techniques usually apply to different workloads rather than stacking on the same request: batch is for asynchronous bulk jobs, caching for repeated interactive prompts. This calculator therefore shows them as separate alternatives against the standard price, so you can see which optimization fits your traffic and how much each is worth at your volume.

Common use cases

  • Bulk job budgeting. See the 50% saving from moving an offline pipeline to the Batch API.
  • Caching ROI. Estimate how much a long shared system prompt saves once cached.
  • Provider comparison. Compare how OpenAI, Anthropic, and Gemini price batch and cached tokens.
  • Architecture decisions. Decide whether async batching or caching is the bigger lever for your traffic.

Frequently asked questions

How big is the Batch API discount?

OpenAI, Anthropic, and Google all discount Batch API jobs by about 50% on both input and output tokens, in exchange for asynchronous processing within a window of up to 24 hours. The exact figure is editable here in case a provider or tier differs.

Can I combine batching with prompt caching?

Generally no — they target different workloads. Batch processing is asynchronous and best for bulk jobs, while caching benefits sequential requests that reuse a prefix. This tool shows them as separate alternatives rather than stacking them, which reflects how they are actually used.

What is the cache read multiplier?

It is the fraction of the normal input price you pay for a cached (repeated) prompt prefix. OpenAI charges about 0.5×, Gemini about 0.25×, and Anthropic about 0.1× for cache reads. Only the cached fraction of your input tokens gets this rate; unique input is billed in full.

Does caching reduce output cost?

No. Caching only discounts input (prompt) tokens that repeat across requests. Output tokens are always generated fresh and billed at the full output rate, which is why the cached scenario here only changes the input side.

Why exclude cache-write costs?

Some providers charge a one-time premium to write a prefix into the cache (Anthropic, for example, bills cache writes at 1.25× the input rate). The saving depends on how many times you then reuse it, so this tool shows the steady-state read saving assuming good reuse, and notes that the first call after each cache expiry pays the write cost.