Self-Hosting a Local LLM vs Paying for an API: Where's the Break-Even?

"Just self-host it, the API will bankrupt us." I have heard that sentence in three planning meetings, and it was wrong in two of them. The build-vs-buy call on LLM inference is not a vibe. It is a crossover between a roughly fixed monthly cost (a GPU running whether or not you use it) and a perfectly variable cost (you pay per token, only when you call the API). Whichever line is lower at your volume wins.

The trap is that almost everyone compares the wrong numbers. They put the sticker price of a 4090 next to a per-million-token API rate and call it a day. Those units do not even match. One is a one-time capital number, the other is a per-token operating rate, and bridging them is the whole problem.

This guide walks the math the way the Self-Hosting vs API Calculator computes it, so the numbers you reach for are ones you can reproduce in the tool instead of figures I made up. Every dollar amount below comes straight out of that calculator's default example. Change the inputs and the crossover moves.

The two cost curves you are actually comparing

An API bill is linear. Cost equals tokens times price. No tokens, no cost. The calculator computes it the obvious way:

API cost / month = (input_tokens × input_price + output_tokens × output_price) / 1,000,000

A self-hosted bill is close to flat. A GPU costs the same per hour whether it serves one request or ten thousand, so your monthly number is mostly fixed:

self-host cost / month = GPUs × 730 hours × GPU_hourly_rate

That 730 is hours in an average month. The important word is GPUs, plural and rounded up. You cannot rent two-thirds of a card. The tool rents whole GPUs, so the self-host curve is not a smooth ramp, it is a staircase. One card carries you up to its throughput ceiling, then you jump to the cost of a second card the moment you exceed it.

So the real comparison is a straight line (API) against a step function (self-host). The break-even is where they cross. Below it the API is cheaper because you are not paying for idle silicon. Above it self-hosting wins because you have spread that fixed cost over enough tokens.

Bridging capex to the number the calculator wants

Here is the unit mismatch I mentioned, solved. The calculator asks for a GPU hourly rate, not a purchase price. If you rent from a cloud, you already have that number. If you own the card, you have to convert your one-time purchase into an effective hourly cost yourself:

effective $/hr = (card price + power + hosting over its life) / expected lifetime hours

Run a consumer card for two years at 24/7 and you have roughly 17,500 hours to amortize over; run it part time and the figure drops, which raises the effective hourly cost. Divide the purchase price plus electricity by the hours you will actually run it, and you get the number to type into the tool. That is the real version of "amortize the hardware": spread a fixed purchase across the hours you expect to use it, then compare that hourly cost head to head with a rental rate.

One caveat on the presets. The calculator's GPU menu mixes consumer cards (3090, 4090, 5090) with data-center parts (A100, H100, L40S). The amortization story is clean for the consumer cards, which people genuinely buy and keep under a desk. The data-center cards in the list are almost always rented, so for those, just take the hourly rate as given. The built-in rates are representative community-cloud defaults, not fixed market prices. Real quotes vary a lot by provider and commitment, so override them with a quote you can actually get. The GPU Cloud Price Comparison is a sane starting point for current rental rates.

Throughput is the hidden variable that sets the staircase

Whether one GPU is enough depends on how fast it generates tokens, and that is bounded by memory bandwidth, not raw compute, for single-stream generation. The calculator estimates it as:

tokens/sec ≈ memory_bandwidth / (active_params × bytes_per_param) × 0.8

The 0.8 is a fudge for real-world inefficiency. The intuition: to emit one token, the GPU streams the whole model through memory once. A smaller or more aggressively quantized model means fewer bytes to move per token, so more tokens per second. This is why quantization shows up here at all, not just in the VRAM math. Q4_K_M weights move under a third the bytes of FP16, so they generate far faster on the same card. If fitting the model in memory is your worry rather than speed, that is a separate calculation, and the LLM VRAM Calculator handles it.

Two limits on that throughput number. It is a single-stream estimate. Real serving stacks like vLLM batch many requests together and push aggregate throughput much higher, which moves break-even in self-hosting's favor. And it is an estimate, not a benchmark, so treat it as a planning figure. To pressure-test it against measured numbers, use the Tokens per Second Calculator.

Throughput times seconds in a month times your utilization gives the token capacity of one GPU. When your monthly output exceeds that, the tool adds a second card, and your cost steps up.

A worked example with the tool's defaults

Load the calculator and these are the values already filled in: 200M input and 50M output tokens a month, both priced at $0.30 per million, an 8B model at Q4_K_M, served on an A100 80GB at the default $1.50/hr and 60% utilization. Here is what falls out, and you can reproduce every figure:

MetricValue
API cost / month$75
Throughput / GPU≈ 352 tok/s
GPUs needed1
Self-host cost / month$1,095
Break-even output volume≈ 730M tok/mo

The API wins by about fourteen times. That is not a quirk of these inputs, it is the typical result for small and medium volume. At 50M output tokens a month you are using under 10% of a single A100's capacity, so you would be paying $1,095 to rent a card that sits idle 90% of the time. The API charges you only for the 50M tokens you actually generate.

The break-even tells you the rest of the story: API and self-host cost the same at roughly 730M output tokens a month. You would need to do more than 14 times your current output before the rented card pays for itself. For a hobby project or an internal tool, that volume never arrives. For a product serving constant traffic, it can arrive fast.

The break-even number has a sharp edge

Read the break-even precisely or you will state it backwards. It is the monthly output volume at which the API bill would equal your current rented-GPU bill, holding the input-to-output ratio fixed. Below that volume, the API is cheaper. Above it, self-hosting is cheaper.

But the staircase complicates the headline number. In the default example, break-even lands at 730M output tokens a month, while a single A100 at 60% utilization only has the capacity for about 554M output tokens a month. So the moment you cross break-even, you have already blown past what one card can serve. You would need a second A100, and your self-host cost steps from $1,095 to $2,190. The calculator's headline break-even holds the GPU count fixed, so it is a local crossover, not a smooth promise that everything past 730M is cheaper.

The practical lesson: never reason about self-hosting as a clean line. Reason about it one card at a time. Each GPU is cheap right up to its throughput ceiling, then the next one resets the math. The economics reward running each card hot. Push utilization up and the same fixed cost spreads over more tokens, which is the single biggest lever you control after picking the model.

How to actually use the calculator to make the call

Plug in your own numbers in this order, because each one moves the result more than the last:

  • Real monthly token volume. Pull input and output tokens from your last full month of API usage. Guessing here makes everything downstream meaningless. If you do not have API history yet, the API side is your cheap way to find out.
  • Your actual API prices. Input and output are priced separately and the gap is often large, so do not average them. To sanity-check a provider's blended bill, the LLM Cost Calculator breaks it down per model.
  • The model you would self-host, with quantization. An 8B at Q4 and a 70B at FP16 are completely different throughput and break-even stories.
  • A GPU rate you can really get, either a cloud quote or your amortized hourly cost from the capex formula above.
  • Realistic utilization. This is where people lie to themselves. If your traffic is bursty, your real utilization might be 20%, not 60%, which roughly triples your effective cost per token. Set it to what you will actually sustain, not your peak.

Then read three outputs together: the monthly cost on each side, the break-even volume, and the GPUs-needed count. If you are far below break-even, stay on the API and stop agonizing. If you are above it and your traffic is steady enough to keep a card busy, self-hosting is worth the operational cost. To compare per-token economics across several API models before you even get to the self-host question, the LLM Cost Comparator is the faster first stop.

What the math leaves out, and why it still matters

A break-even calculator answers a money question, and money is not the whole decision. Self-hosting carries costs that do not show up as dollars per hour: someone has to keep the serving stack alive, patch it, monitor it, and get paged when it falls over at 2am. That engineering time is real and it is not free. For a small team, the salaried hours spent babysitting an inference box can dwarf whatever you saved on tokens.

The API has its own non-money costs: rate limits, data leaving your network, and a vendor who can change prices or deprecate your model. If those are dealbreakers for compliance or latency reasons, you might self-host below break-even on purpose and accept the premium. The point of the calculator is to tell you exactly how big that premium is, so you are choosing it with eyes open instead of discovering it on the invoice.

Run your own numbers in the Self-Hosting vs API Calculator. It is client-side, nothing you type leaves your browser, and it will give you a defensible crossover point in about a minute. That number, plus a clear-eyed accounting of the operational overhead, is the whole decision.

Use the tool

Skip the manual work. The companion tool runs this in your browser, with nothing uploaded.

Self-Hosting vs API Calculator

Frequently asked questions

Is self-hosting an LLM cheaper than using an API?

Only above a break-even volume. A GPU costs the same per hour whether busy or idle, so at low or medium volume the per-token API almost always wins. In the calculator's default example (50M output tokens a month, an 8B model on an A100), the API costs $75 versus $1,095 to self-host, and break-even sits near 730M output tokens a month. Self-hosting pays off only when you sustain high volume on a card you keep busy.

How do I turn a GPU's purchase price into the hourly rate the calculator wants?

Divide the total cost of ownership by the hours you expect to run it: (card price + power + hosting over its life) / lifetime hours. A consumer card run 24/7 for two years gives roughly 17,500 hours to amortize over; run it part time and that drops, raising the hourly figure. That effective hourly cost is what you type into the GPU $/hr field so it compares directly against a cloud rental rate.

Why does the calculator rent whole GPUs instead of fractions?

Because that is how real infrastructure works. You cannot rent two-thirds of a card, so cost is lumpy, not smooth. One GPU carries you up to its throughput ceiling, then the next request forces a second card and the monthly cost steps up. That staircase is why break-even is a local crossover at the current GPU count, not a clean straight line.

What sets how many tokens per second a GPU can produce?

For single-stream generation it is mostly memory bandwidth, not raw compute. The model streams through memory once per token, so fewer bytes per token means more tokens per second. That is why quantization speeds things up: Q4 weights move under a third the bytes of FP16. The calculator's estimate is single-stream, though. Batched serving with vLLM pushes real aggregate throughput much higher.

My break-even volume is higher than one GPU can serve. What does that mean?

It means you would cross into needing a second GPU before the first one pays for itself, so your self-host cost steps up at break-even rather than staying flat. The headline break-even holds GPU count fixed, so always cross-check it against the GPUs-needed output. Reason about self-hosting one card at a time, not as a single line.

Should cost be the only factor in the self-host vs API decision?

No. Self-hosting adds operational overhead (patching, monitoring, on-call) that does not appear as dollars per hour, and the API adds rate limits, data egress, and vendor risk. Use the calculator to find the exact dollar premium of each choice, then weigh it against those non-money factors. Sometimes paying more to self-host for compliance or latency is the right call.

Related tools