Assistants / Thread Cost Estimator

Estimate the true cost of a multi-turn chat session, including the often-surprising effect of history accumulation. At each turn the full conversation history is re-sent as input, so costs grow superlinearly with turns. Set your model, system prompt size, message lengths, turn count, and history window to see per-turn and total costs.

How to use the Assistants / Thread Cost Estimator

Set the model and the four token size inputs, then choose a history window:

  • System prompt tokens — the fixed instruction block prepended to every API call. Estimate via the LLM cost calculator.
  • Avg user msg tokens — average tokens in each user turn.
  • Avg assistant reply tokens — average tokens generated per turn. This is the output cost driver and also accumulates in history.
  • Number of turns — total turns in the conversation (one turn = one user message + one assistant reply).
  • History window (0 = all) — how many prior turns of context are sent with each new API call. 0 means the full conversation history is resent every turn. Setting N means only the last N turns are included. Reducing the window cuts input cost dramatically at the expense of the model forgetting earlier turns.

The result grid shows: total input tokens, total output tokens, total cost, per-turn average cost, and the input size of the final turn (the most expensive one).

Why multi-turn chat costs grow superlinearly

Stateless API design means that every call to a chat completion endpoint must include the full conversation history — the LLM has no memory between calls. At turn k, the input tokens are: system prompt + all prior user messages + all prior assistant replies (possibly truncated to a window) + the new user message. If you retain full history (window = 0), the input at turn k is O(k) tokens, so the cumulative cost across all turns is O(k²) — roughly quadratic in the number of turns. A 20-turn conversation with 300-token average messages sends roughly 6x more input tokens in total than if each call were independent. See the LLM cost calculator for single-call estimates.

The history window parameter mirrors what most production assistants implement via "sliding window" or "message trimming" strategies. By keeping only the last N turns, you cap the per-call input size at approximately sys + N*(umsg+amsg) + umsg tokens per call, which makes costs O(k) rather than O(k²) — linear instead of quadratic. The trade-off is that the model loses access to earlier turns, which can cause it to forget user preferences, agreed facts, or prior tool results. Finding the right window size is a key product engineering decision for chat-based products.

Common use cases

  • Budget planning for chat products — estimate monthly API spend for a given average conversation length and expected user volume before launch.
  • Window size optimisation — compare full-history vs windowed-history cost at 20, 30, 50 turns to find the break-even point for your quality requirements.
  • Model selection — compare gpt-5-mini vs claude-haiku vs claude-sonnet across realistic conversation parameters to choose the right cost-quality tradeoff.
  • Pricing model design — if you charge users per conversation, use this to calculate your cost per conversation and set a margin-positive price.
  • System prompt optimisation — see the dollar impact of reducing system prompt size from 2000 to 500 tokens across 10,000 daily conversations.

Frequently asked questions

Why does cost grow faster than linearly with turns?

Because every API call includes the full conversation history so far. Turn 1 sends ~sys+umsg tokens; turn 10 sends sys + 9*(umsg+amsg) + umsg tokens. The cumulative input across 10 turns is much larger than 10x the first turn.

Does the history window setting simulate a real API feature?

No — history windowing is application-level logic you implement yourself. You keep a list of messages and trim it to the last N turns before sending to the API. The API itself receives only what you send; it has no memory state.

Are caching or prompt caching discounts included?

No. This tool uses list prices for uncached tokens. Claude and GPT-4o both offer prompt caching (repeated prefixes are served at a lower rate). In practice, the system prompt portion is often cached, which can reduce real costs by 20-50% on long conversations. Enable prompt caching in your API client and benchmark the savings.

What are the units for model rates?

Per-million tokens. A rate of $1.25/M input means 1,000,000 input tokens cost $1.25. If your final turn sends 5,000 input tokens, the input cost for that turn is 5,000 / 1,000,000 × $1.25 = $0.00625.

How do I count tokens for my actual messages?

Use the OpenAI token counter for GPT models. For Claude, the Anthropic API provides a token-counting endpoint. As a rough rule, 1 token ≈ 0.75 English words.