Vision Token Calculator

Images are billed as tokens, and every provider counts them differently. Enter an image's width and height to see how many input tokens it costs on GPT-4o, Claude, and Gemini — each computed with that provider's actual tiling rules — and turn it into a dollar figure. Useful when a vision-heavy app's bill is dominated by images, not text. Calculated in your browser.

Image width (px) Image height (px) OpenAI detail Number of images

Cost for Input price ($/M tokens)

Estimated image cost:

Token counts use each provider's documented tiling: OpenAI 512px tiles after resizing, Claude ~(w×h)/750 after capping at 1568px / 1.15MP, Gemini 768px crops (258 tokens each, or 258 for images up to 384px). GPT-4o-mini multiplies these tokens by a model-specific factor.

How to use the Vision Token Calculator

Type the image's pixel dimensions and how many images you send per request (or per the period you are costing). For OpenAI, choose the detail level: low is a flat 85 tokens regardless of size, while high tiles the image and costs far more for large pictures. The table updates with the token count for each provider.

To get a dollar figure, pick the provider you are billing on and enter its input price per million tokens — image tokens are charged at the same input rate as text. The cost line multiplies the provider's per-image token count by your image count and price. Because the three providers tile images so differently, the cheapest one for a given image size is not always obvious; the table lets you compare at a glance.

How each provider counts image tokens

OpenAI (GPT-4o) uses a tile model. In low detail an image is a flat 85 tokens. In high detail the image is first scaled to fit within 2048×2048, then scaled so its shortest side is 768px, then divided into 512×512 tiles. The cost is 85 base tokens plus 170 per tile, so a 1024×1024 image works out to 765 tokens. GPT-4o-mini uses the same tiling but multiplies the result by a large model-specific factor, making images proportionally far more expensive on the mini model.

Claude uses a simple area formula: tokens are approximately width times height divided by 750. Before counting, Claude resizes any image whose long edge exceeds 1568px or whose area exceeds about 1.15 megapixels down to those limits, so very large images are capped rather than billed in full. A 1092×1092 image is near Claude's effective maximum at roughly 1590 tokens; smaller images cost proportionally less.

Gemini tiles too, but with 768px crops worth 258 tokens each. An image where both sides are 384px or smaller is a flat 258 tokens. Larger images are divided into 768px tiles, each adding 258 tokens. The practical upshot is that the three schemes diverge sharply: small thumbnails are cheapest on OpenAI's low detail, mid-size images are often cheapest on Claude, and the right choice for a high-volume vision workload can change the bill by a large multiple. Always confirm against each provider's current documentation, since these formulas are periodically revised.

Common use cases

Budgeting a vision app. Estimate the per-image token cost before you process thousands of screenshots, receipts, or photos.
Choosing a provider. Compare GPT-4o, Claude, and Gemini for your typical image size to find the cheapest per image.
Deciding on detail level. See how much OpenAI's low-detail mode saves when full resolution is not needed.
Sizing image preprocessing. Find the resolution at which downscaling stops reducing tokens because the provider already caps it.

Frequently asked questions

Why does the same image cost different tokens on each provider?

Each provider tiles and counts images with its own formula. OpenAI uses 512px tiles plus a base cost, Claude uses an area-based (w x h / 750) estimate after capping the size, and Gemini uses 768px crops worth 258 tokens each. So identical pixels map to very different token counts.

Are image tokens billed at the input rate?

Yes. Vision tokens are added to your input (prompt) tokens and billed at the model's input price. That is why this tool uses an input price per million tokens to compute the dollar cost.

Why is GPT-4o-mini so expensive for images?

GPT-4o-mini uses the same tiling as GPT-4o but multiplies the resulting image tokens by a large factor so that image cost is comparable in dollars to GPT-4o despite mini's lower text price. For image-heavy workloads, mini is often not the savings it appears to be on text alone.

Does downscaling an image always reduce cost?

Only up to a point. Each provider caps or resizes large images before counting, so beyond their internal maximum, sending a bigger image does not increase tokens. Below that, smaller images do cost fewer tokens — especially on OpenAI high detail and Gemini, where fewer tiles means real savings.

Are these formulas guaranteed accurate?

They reflect each provider's documented behavior, but the exact tiling and multipliers are revised from time to time and can vary by model version. Treat the counts as close estimates and confirm against the current API documentation for billing-critical work.

Embed this tool on your site

Free to embed, no attribution required (but appreciated). Paste this where you want the tool to appear:

<iframe src="https://codeswap.net/llm/vision-token-calculator/?embed=1" width="100%" height="520" loading="lazy" style="border:1px solid #e5e7eb;border-radius:8px" title="Vision Token Calculator"></iframe>
<p style="font-size:13px">Tool by <a href="https://codeswap.net/llm/vision-token-calculator/">Vision Token Calculator — Codeswap</a></p>