Self-Hosting vs API Cost Calculator
Work out whether it is cheaper to self-host an open model on rented GPUs or to pay a per-token API. Enter your monthly token volume and the API price, pick a model size, quantization, and GPU, and the calculator estimates GPU throughput, the number of GPUs you would need, the self-hosted monthly cost, and the volume at which self-hosting overtakes the API. All calculation runs in your browser.
Throughput is a memory-bandwidth estimate for single-stream decode (tok/s ≈ bandwidth ÷ (active params × bytes/param) × 0.8); real serving with batching can be much higher. GPU rates are representative community-cloud on-demand prices — edit to match your host.
How to use the Self-Hosting vs API Cost Calculator
Enter your monthly input and output token volumes in millions and the per-million API prices you would otherwise pay. Then describe the self-hosted setup: the model's active parameter count, the quantization you would run, the GPU, its hourly rental rate, and how busy you expect to keep it. The calculator estimates decode throughput, the number of GPUs required to meet your output volume, the self-hosted monthly bill, and the break-even token volume where the two approaches cost the same.
Utilization is the key lever for self-hosting: a GPU you rent around the clock but use only 60% of the time costs the same as one used fully, so low utilization inflates the effective per-token cost. Raising batching and utilization is what makes self-hosting competitive at scale.
When self-hosting beats the API
Per-token APIs charge nothing when idle and scale to zero, which makes them the cheapest option at low and bursty volumes. Self-hosting flips the economics: you pay for GPU time whether or not you are generating, so the cost per token depends entirely on how busy you keep the hardware. There is therefore a break-even volume below which the API wins and above which a well-utilized GPU is cheaper.
Throughput on a single stream is largely memory-bandwidth bound during decode: each generated token requires reading the model's weights from GPU memory, so tokens per second is roughly the GPU's memory bandwidth divided by the model's size in bytes (active parameters times bytes per parameter), discounted for overhead. Quantizing the model shrinks the bytes per parameter, which both fits a bigger model in memory and raises throughput proportionally — a Q4 model decodes roughly three times faster than the same model in FP16.
This estimate assumes single-stream decode, which is conservative: production servers batch many requests together and can achieve far higher aggregate throughput per GPU, lowering the real cost per token. Treat the result as a planning floor for throughput and a ceiling for cost, and remember it ignores engineering time, reliability, and the fixed overhead of running infrastructure — real reasons teams often stay on an API even past the raw break-even point.
Common use cases
- Build-vs-buy decisions. See whether your volume justifies running your own GPUs.
- Quantization trade-offs. Watch how Q4 vs FP16 changes throughput and GPU count.
- GPU selection. Compare an RTX 4090 against an A100 or H100 for your workload.
- Break-even planning. Find the monthly token volume where self-hosting starts to pay off.