vLLM serve Command Generator

Generate production-ready vllm serve commands without memorising every flag. Set the model path, GPU count, dtype, quantisation method, memory utilisation, and optional flags — the command rebuilds live and is ready to copy into your terminal or deployment script.

How to use the vLLM serve Command Generator

Fill in the fields and the command updates live:

  • Model — a HuggingFace Hub ID (meta-llama/Llama-3.1-8B-Instruct) or an absolute local path (/models/llama3). vLLM downloads Hub models automatically if HUGGING_FACE_HUB_TOKEN is set.
  • Tensor parallel — number of GPUs to shard the model across. Must divide the model's number of attention heads evenly. Use 1 for single-GPU, 2/4/8 for multi-GPU NVLink setups.
  • dtypeauto follows the model's config; bfloat16 is preferred on Ampere+ GPUs; float16 for older hardware. Mismatching dtype and quantisation (e.g. float16 + fp8) may error.
  • Quantisation — AWQ and GPTQ require a pre-quantised model; fp8 applies at runtime on Ada/Hopper hardware.
  • Max model len — sets the KV-cache capacity in tokens. Reduce this if you hit OOM at startup. Must not exceed the model's trained context length.
  • GPU mem utilisation — fraction of GPU VRAM to allocate for the KV-cache (the rest is used by model weights). 0.9 is a safe default; lower if you share the GPU.
  • --enforce-eager — disables CUDA graph capture; slower throughput but useful for debugging or when graph capture crashes on your GPU.
  • --trust-remote-code — required for models that ship custom model code (e.g., Qwen, Phi-3). Only enable for models you trust.
  • --enable-chunked-prefill — allows the prefill of long prompts to be chunked, improving GPU utilisation when mixing long and short requests.

What vLLM is and why the flags matter

vLLM is a high-throughput, memory-efficient inference engine for large language models. It implements PagedAttention — a KV-cache management scheme that treats GPU memory like virtual memory pages, eliminating fragmentation and enabling continuous batching across requests with different sequence lengths. This makes vLLM the go-to serving stack when you need to serve many concurrent users from a single model rather than running separate per-user processes.

The vllm serve subcommand launches an OpenAI-compatible HTTP server. Because it implements the /v1/chat/completions and /v1/completions endpoints, any client that works with the OpenAI Python SDK can point at a vLLM server by changing only the base_url. The flags control how the engine allocates GPU resources: tensor parallelism splits the model across multiple GPUs using Megatron-style sharding; --gpu-memory-utilization controls how much VRAM the KV-cache can consume versus model weights; and --max-model-len caps the longest sequence the server will accept.

Getting these flags wrong is the primary source of OOM errors and silent performance regressions. Too-high GPU memory utilisation leaves no room for the KV-cache of long prompts; too-low wastes expensive VRAM. Tensor parallel degree that does not divide the attention head count cleanly causes an error at load time. This generator surfaces the dependencies so you can tune them interactively before running the server.

Common use cases

  • Deploying open-weight models — generate the correct serve command for Llama 3, Qwen 2.5, Mistral, or any HuggingFace model without consulting the docs each time.
  • Multi-GPU scaling — quickly set tensor-parallel size to match your node topology (2-, 4-, or 8-GPU setups) and verify the command before launching.
  • Quantised model serving — combine AWQ/GPTQ quantised checkpoints with the right dtype flag and ensure the quantisation argument is present.
  • CI/CD deployment scripts — copy the generated command directly into a Dockerfile CMD, Kubernetes args, or a Bash launch script.
  • Memory debugging — toggle --enforce-eager to disable CUDA graph capture when diagnosing OOM crashes on new GPU hardware.

Frequently asked questions

What tensor-parallel sizes are valid?

The TP size must divide the model's number of key-value heads evenly. For Llama 3.1-8B (32 heads), valid TP sizes are 1, 2, 4, 8, 16, 32. For models with 8 GQA heads (like Llama 3.1-70B), valid values are 1, 2, 4, 8. vLLM will error at model-load time if the split is uneven.

What is the difference between AWQ and GPTQ quantisation?

Both are 4-bit weight quantisation formats but use different calibration algorithms. AWQ (Activation-aware Weight Quantisation) is generally faster at inference; GPTQ (Generative Pre-trained Transformer Quantisation) has a larger ecosystem of pre-quantised models. Both require the quantised checkpoint; neither is applied on-the-fly to a full-precision model.

How does --max-model-len interact with GPU memory?

The KV-cache size grows linearly with max-model-len. If you set it too high, vLLM may fail to allocate enough blocks at startup. Reduce max-model-len or increase gpu-memory-utilization if you see "No available memory for the cache blocks" errors.

Is --trust-remote-code safe?

Only for models you fully trust. The flag allows the model repository to execute arbitrary Python code during loading. For well-known open models (Qwen, Phi-3, InternLM), it is generally safe. Never enable it for an untrusted model checkpoint.

Does the generated command work on Windows?

vLLM requires Linux. It does not support Windows natively. You can use WSL2 on Windows with CUDA support, but production deployments are Linux-only (typically Ubuntu 22.04+ with CUDA 12.x).