Self-Hosting vs API Cost Calculator

Work out whether it is cheaper to self-host an open model on rented GPUs or to pay a per-token API. Enter your monthly token volume and the API price, pick a model size, quantization, and GPU, and the calculator estimates GPU throughput, the number of GPUs you would need, the self-hosted monthly cost, and the volume at which self-hosting overtakes the API. All calculation runs in your browser.

Throughput is a memory-bandwidth estimate for single-stream decode (tok/s ≈ bandwidth ÷ (active params × bytes/param) × 0.8); real serving with batching can be much higher. GPU rates are representative community-cloud on-demand prices — edit to match your host.

How to use the Self-Hosting vs API Cost Calculator

Enter your monthly input and output token volumes in millions and the per-million API prices you would otherwise pay. Then describe the self-hosted setup: the model's active parameter count, the quantization you would run, the GPU, its hourly rental rate, and how busy you expect to keep it. The calculator estimates decode throughput, the number of GPUs required to meet your output volume, the self-hosted monthly bill, and the break-even token volume where the two approaches cost the same.

Utilization is the key lever for self-hosting: a GPU you rent around the clock but use only 60% of the time costs the same as one used fully, so low utilization inflates the effective per-token cost. Raising batching and utilization is what makes self-hosting competitive at scale.

When self-hosting beats the API

Per-token APIs charge nothing when idle and scale to zero, which makes them the cheapest option at low and bursty volumes. Self-hosting flips the economics: you pay for GPU time whether or not you are generating, so the cost per token depends entirely on how busy you keep the hardware. There is therefore a break-even volume below which the API wins and above which a well-utilized GPU is cheaper.

Throughput on a single stream is largely memory-bandwidth bound during decode: each generated token requires reading the model's weights from GPU memory, so tokens per second is roughly the GPU's memory bandwidth divided by the model's size in bytes (active parameters times bytes per parameter), discounted for overhead. Quantizing the model shrinks the bytes per parameter, which both fits a bigger model in memory and raises throughput proportionally — a Q4 model decodes roughly three times faster than the same model in FP16.

This estimate assumes single-stream decode, which is conservative: production servers batch many requests together and can achieve far higher aggregate throughput per GPU, lowering the real cost per token. Treat the result as a planning floor for throughput and a ceiling for cost, and remember it ignores engineering time, reliability, and the fixed overhead of running infrastructure — real reasons teams often stay on an API even past the raw break-even point.

Common use cases

  • Build-vs-buy decisions. See whether your volume justifies running your own GPUs.
  • Quantization trade-offs. Watch how Q4 vs FP16 changes throughput and GPU count.
  • GPU selection. Compare an RTX 4090 against an A100 or H100 for your workload.
  • Break-even planning. Find the monthly token volume where self-hosting starts to pay off.

Frequently asked questions

How is throughput estimated?

Single-stream decode is mostly limited by memory bandwidth: tokens per second ≈ GPU memory bandwidth ÷ (active parameters × bytes per parameter) × 0.8 for overhead. This is conservative — batched serving achieves much higher aggregate throughput, so real per-token cost when self-hosting is usually lower than this estimate suggests.

Why does utilization matter so much?

You pay for rented GPU time whether or not it is generating tokens. At 60% utilization you waste 40% of what you pay for, so the effective cost per token is higher. Keeping GPUs busy with batching is what makes self-hosting cheaper than an API at scale.

What counts as active parameters?

For a dense model it is the full parameter count. For a mixture-of-experts model only a fraction of parameters are active per token, so enter that active count — it is what determines decode speed and memory traffic, not the total parameter count.

Does this include input (prompt) processing?

The GPU-time estimate is based on output tokens, since decode dominates wall-clock time and is bandwidth-bound. Prompt processing is compute-bound and much faster per token, so it is folded into the overhead factor rather than modelled separately. Both input and output tokens are still counted for the API cost comparison.

Are the GPU prices accurate?

They are representative community-cloud on-demand rates and vary widely by provider, region, and commitment. Hyperscaler prices are often several times higher; spot and reserved pricing lower. Edit the GPU $/hr field to match your actual cost.