LLM API Pricing Table

A single sortable table of what the major large language models actually cost to call. Every row shows the input and output price per million tokens, the context window, and any notable extras like cached-input discounts. Type your own per-call token counts and the table adds a live cost-per-call column so you can rank models by what your workload would pay, not just the sticker price. Filtering and sorting run entirely in your browser.

Prices are list / standard-tier USD per 1,000,000 tokens, compiled 2026-05. Providers change prices often, and open-weight models (Llama, Qwen, Mixtral, Gemma) are billed differently by each host, so those rows show a representative figure marked hosted, varies. Always confirm on the provider's own pricing page before budgeting. Batch processing, cached input, and committed-volume tiers can cut these numbers substantially.

How to use the LLM API Pricing Table

The table loads sorted by input price, cheapest first. Click any column heading to re-sort — click Output $/1M to rank by completion cost, Context to find the longest windows, or click the same heading again to reverse the order. Use the Provider dropdown to narrow to one vendor, or type into Search model to filter by name (for example "haiku" or "flash").

The most useful control is the pair of token boxes. Enter how many input and output tokens a typical request uses and the Cost / call column recomputes for every model at once. A summarisation job might be 4,000 input and 300 output tokens; a chat reply might be 800 in and 400 out; a reasoning task can be 1,000 in and 8,000 out. Because input and output are priced separately — and output is usually three to five times dearer — the cheapest model for a read-heavy task is often not the cheapest for a generation-heavy one. Set the ratio that matches your use and sort by Cost / call to see the real ranking.

Everything recalculates locally as you type. Nothing about your usage is sent anywhere.

How LLM API pricing works

Hosted language models are billed per token, not per word or per request. A token is a chunk of text averaging about four characters in English, so roughly 750 words come to about 1,000 tokens. Providers quote a price per million tokens and split it into two rates: the input (or prompt) price you pay for everything you send, and the output (or completion) price you pay for everything the model generates. Output almost always costs more — frequently three to five times more — because generating tokens one at a time is the expensive part of inference.

Several factors move the effective price below the headline number. Prompt caching lets you re-send a large fixed prefix (a long system prompt, a document, a tool schema) at a steep discount on subsequent calls — the cached-input figures in the Notes column. Batch APIs trade latency for roughly half-price asynchronous processing. Reasoning models such as o1 and DeepSeek-R1 look cheap per token but emit large volumes of hidden "thinking" tokens that you are billed for at the output rate, so their real cost per finished answer is much higher than the sticker suggests. And open-weight models — Llama, Qwen, Mixtral, Gemma — have no single official price; each host (Together, Fireworks, Groq, your own GPUs) sets its own, which is why those rows are flagged as varying.

For budgeting, the number that matters is cost per task, not cost per token. Multiply your expected input tokens by the input price and your expected output tokens by the output price, add them, and multiply by your call volume. The token boxes above do exactly that arithmetic across every model at once, which is why a model that wins on raw input price can still lose once a long, generation-heavy output is priced in.

Common use cases

  • Picking a model for a new feature. Set the input/output token mix your feature uses and sort by Cost / call to see which models fit the budget before you write any code.
  • Cutting an existing bill. Find a cheaper model with a similar capability tier — for example moving a classification task from a flagship to a mini or Flash model can drop cost by 10× or more.
  • Comparing reasoning vs standard models. Reasoning models bill their hidden thinking at the output rate; raise the output-tokens box to see the true cost difference against a standard model.
  • Choosing a long-context model. Sort by Context to find windows large enough for whole-codebase or whole-document prompts, then check what that context costs to fill.
  • Estimating monthly spend. Get the per-call figure here, then multiply by your expected daily request volume for a quick run-rate.

Frequently asked questions

Why is output more expensive than input?

Generating tokens is autoregressive — the model runs a full forward pass for every output token, one after another — whereas the input prompt is processed in parallel in a single pass. That sequential generation is the costly part of inference, so providers price output two to five times higher than input.

How current are these prices?

They were compiled in May 2026 from public provider pricing pages. LLM prices change frequently and often drop, so treat this table as a planning snapshot and confirm the live figure on the provider you actually use before committing a budget.

Why do open-weight models like Llama show "hosted, varies"?

Open-weight models have no single official API price. Each inference host — Together, Fireworks, Groq, Replicate, or your own GPUs — sets its own rate, and self-hosting cost depends entirely on your hardware and utilisation. The figure shown is a representative hosted price to make the row comparable, not an authoritative number.

Do reasoning models really cost what the table shows?

The per-token price is accurate, but reasoning models such as o1 and DeepSeek-R1 generate large amounts of intermediate "thinking" output that you are billed for at the output rate. A single answer can consume thousands of these tokens, so raise the output-tokens box to model their true cost per finished response rather than per token.

How do I estimate my total monthly cost?

Enter your typical input and output tokens per call to get the Cost / call figure, then multiply by your expected number of calls per month. For workloads with a large fixed prompt, check whether the provider offers prompt caching — it can cut the input portion by 75–90% on repeated calls.