vLLM serve Command Generator
Generate production-ready vllm serve commands without memorising every flag. Set the model path, GPU count, dtype, quantisation method, memory utilisation, and optional flags — the command rebuilds live and is ready to copy into your terminal or deployment script.
How to use the vLLM serve Command Generator
Fill in the fields and the command updates live:
- Model — a HuggingFace Hub ID (
meta-llama/Llama-3.1-8B-Instruct) or an absolute local path (/models/llama3). vLLM downloads Hub models automatically ifHUGGING_FACE_HUB_TOKENis set. - Tensor parallel — number of GPUs to shard the model across. Must divide the model's number of attention heads evenly. Use 1 for single-GPU, 2/4/8 for multi-GPU NVLink setups.
- dtype —
autofollows the model's config;bfloat16is preferred on Ampere+ GPUs;float16for older hardware. Mismatching dtype and quantisation (e.g. float16 + fp8) may error. - Quantisation — AWQ and GPTQ require a pre-quantised model; fp8 applies at runtime on Ada/Hopper hardware.
- Max model len — sets the KV-cache capacity in tokens. Reduce this if you hit OOM at startup. Must not exceed the model's trained context length.
- GPU mem utilisation — fraction of GPU VRAM to allocate for the KV-cache (the rest is used by model weights). 0.9 is a safe default; lower if you share the GPU.
- --enforce-eager — disables CUDA graph capture; slower throughput but useful for debugging or when graph capture crashes on your GPU.
- --trust-remote-code — required for models that ship custom model code (e.g., Qwen, Phi-3). Only enable for models you trust.
- --enable-chunked-prefill — allows the prefill of long prompts to be chunked, improving GPU utilisation when mixing long and short requests.
What vLLM is and why the flags matter
vLLM is a high-throughput, memory-efficient inference engine for large language models. It implements PagedAttention — a KV-cache management scheme that treats GPU memory like virtual memory pages, eliminating fragmentation and enabling continuous batching across requests with different sequence lengths. This makes vLLM the go-to serving stack when you need to serve many concurrent users from a single model rather than running separate per-user processes.
The vllm serve subcommand launches an OpenAI-compatible HTTP server. Because it implements the /v1/chat/completions and /v1/completions endpoints, any client that works with the OpenAI Python SDK can point at a vLLM server by changing only the base_url. The flags control how the engine allocates GPU resources: tensor parallelism splits the model across multiple GPUs using Megatron-style sharding; --gpu-memory-utilization controls how much VRAM the KV-cache can consume versus model weights; and --max-model-len caps the longest sequence the server will accept.
Getting these flags wrong is the primary source of OOM errors and silent performance regressions. Too-high GPU memory utilisation leaves no room for the KV-cache of long prompts; too-low wastes expensive VRAM. Tensor parallel degree that does not divide the attention head count cleanly causes an error at load time. This generator surfaces the dependencies so you can tune them interactively before running the server.
Common use cases
- Deploying open-weight models — generate the correct serve command for Llama 3, Qwen 2.5, Mistral, or any HuggingFace model without consulting the docs each time.
- Multi-GPU scaling — quickly set tensor-parallel size to match your node topology (2-, 4-, or 8-GPU setups) and verify the command before launching.
- Quantised model serving — combine AWQ/GPTQ quantised checkpoints with the right dtype flag and ensure the quantisation argument is present.
- CI/CD deployment scripts — copy the generated command directly into a Dockerfile
CMD, Kubernetesargs, or a Bash launch script. - Memory debugging — toggle --enforce-eager to disable CUDA graph capture when diagnosing OOM crashes on new GPU hardware.