Stable Diffusion VRAM Calculator

Estimate the GPU memory needed to generate images locally with Stable Diffusion, SDXL, SD 3.5, or Flux. Image models split their VRAM between the model weights and the activations that grow with resolution and batch size. Pick a model, a precision, and a target resolution to see the total and which cards can run it. Calculated in your browser — nothing uploaded.

Model weights
Activations
Total VRAM
Verdict

Planning estimate. Real usage depends on the sampler, attention optimisation (xFormers / SDPA), VAE tiling, and whether the text encoder is offloaded to CPU.

How to use the Stable Diffusion VRAM Calculator

Pick the model you want to run — the calculator knows each one's weight size and native resolution. Choose a precision: FP16 is the default, FP8 roughly halves the weight memory, and NF4/INT4 quantization (popular for fitting Flux on smaller cards) cuts it further. Set your target width, height, and batch size.

The weights row is the model itself and barely changes with resolution. The activations row is the working memory the U-Net or transformer needs during sampling, and it grows with the pixel count and the batch — doubling the resolution roughly quadruples it because area scales with width times height. The verdict line tells you the smallest common GPU tier that fits. If you are tight, drop the precision, lower the resolution and upscale afterward, or generate one image at a time.

Where image-model VRAM goes

A diffusion image model spends VRAM in two main places. The first is the weights — the U-Net (in SD and SDXL) or the diffusion transformer (in SD 3.5 and Flux), plus one or more text encoders and a VAE. SD 1.5's U-Net is under a billion parameters, so it fits comfortably in a few gigabytes; SDXL is larger; Flux.1 is a 12-billion-parameter transformer that needs roughly 24 GB at FP16, which is why so much effort goes into FP8 and NF4 quantization to bring it onto consumer cards.

The second is activations: the intermediate tensors computed at each denoising step. Diffusion works in a compressed latent space (the VAE downsamples by 8x), but the attention layers still scale with the number of latent positions, so memory grows with the image area. Generating at 1536x1536 instead of 1024x1024 more than doubles the activation memory, and a batch of four multiplies it again. This is why a model that generates a single 1024px image happily can run out of memory at high resolution or large batches.

Several techniques shift this trade-off. Attention optimisations such as xFormers or PyTorch's scaled-dot-product attention cut activation memory substantially. VAE tiling decodes the final image in pieces to avoid a large spike at the end. Sequential CPU offload moves the text encoder and VAE off the GPU between stages, trading speed for a much smaller footprint — that is how people run Flux on 8-12 GB cards. The numbers here assume standard optimised attention without aggressive offloading, so treat them as a sensible upper-middle estimate.

Common use cases

  • Checking if Flux fits your card. See FP16 versus FP8 versus NF4 side by side to find a precision that loads on your GPU.
  • Choosing a generation resolution. Find the largest resolution your VRAM supports before you hit out-of-memory errors.
  • Planning batch generation. Estimate how many images you can generate in parallel within your memory budget.
  • Comparing models. Weigh SDXL against SD 3.5 or Flux for the hardware you actually have.

Frequently asked questions

Can I run Flux on a 12 GB card?

Often yes, with NF4 or FP8 quantization plus CPU offload of the text encoder and VAE. At full FP16 Flux needs roughly 24 GB. Set the precision to NF4 here to see the quantized weight footprint; real-world offloading can bring it lower still at the cost of speed.

Why does higher resolution cost so much more memory?

Activation memory scales with image area, which is width times height. Going from 1024 to 1536 on each side more than doubles the pixel count, so the working memory the attention layers need grows faster than the resolution number itself suggests.

Does this include LoRAs or ControlNet?

No. LoRAs add a small amount of weight memory; ControlNet adds a second network and noticeably more. If you stack adapters, budget extra VRAM beyond this estimate.

How can I reduce VRAM if I am just over the limit?

Lower the precision, enable attention optimisation (xFormers or SDPA), turn on VAE tiling, generate one image at a time, or use sequential CPU offload. Each trades some speed for a smaller footprint, and together they can roughly halve the requirement.

Are these numbers exact?

They are planning estimates. The real figure depends on your sampler, attention backend, offload settings, and framework (ComfyUI, Automatic1111, diffusers). Treat the total as a reasonable upper-middle bound rather than a precise measurement.