Skip to main content
Most LLMs need specific configuration to run properly on vLLM. You need to understand what settings your model expects for loading, tokenization, and generation. This guide covers how to configure your vLLM endpoints for different model families, how environment variables map to vLLM command-line flags, and recommended configurations for popular models, and how to select the right GPU for your model.

Why is vLLM so hard to configure?

Without the right settings, your vLLM workers may fail to load, produce incorrect outputs, or miss key features. vLLM supports hundreds of models, but default settings only work out of the box for a subset of them. Different model architectures have different requirements for tokenization, attention mechanisms, and features like tool calling or reasoning. For example, Mistral models need their own tokenizer mode and config format, while reasoning models like DeepSeek-R1 need a reasoning parser enabled. When deploying a model, check its Hugging Face README and the vLLM documentation for required or recommended settings.

Mapping environment variables to vLLM CLI flags

When running vLLM with vllm serve, you configure the engine with command-line flags. On Runpod, you set these options with environment variables instead. Each vLLM command-line argument has a corresponding environment variable. Convert the flag name to uppercase with underscores: --tokenizer_mode becomes TOKENIZER_MODE, --enable-auto-tool-choice becomes ENABLE_AUTO_TOOL_CHOICE, and so on. For all available vLLM engine arguments, see the vLLM engine arguments documentation.

Example: Deploying Mistral

To launch a Mistral model using the vLLM CLI, you could run the following command:
vllm serve mistralai/Ministral-8B-Instruct-2410 \
  --tokenizer_mode mistral \
  --config_format mistral \
  --load_format mistral \
  --enable-auto-tool-choice \
  --tool-call-parser mistral
On Runpod, set these options as environment variables when configuring your endpoint:
Environment variableValue
MODEL_NAMEmistralai/Ministral-8B-Instruct-2410
TOKENIZER_MODEmistral
CONFIG_FORMATmistral
LOAD_FORMATmistral
ENABLE_AUTO_TOOL_CHOICEtrue
TOOL_CALL_PARSERmistral
This pattern applies to any vLLM command-line flag. Find the corresponding environment variable name and add it to your endpoint configuration.

Model-specific configurations

The table below lists recommended environment variables for popular model families. These settings handle common requirements like tokenization modes, tool calling support, and reasoning capabilities. Not all models in a family require all settings. Check your model’s documentation for exact requirements.
Model familyExample modelKey environment variablesNotes
Qwen3Qwen/Qwen3-8BENABLE_AUTO_TOOL_CHOICE=true TOOL_CALL_PARSER=hermesQwen models often ship in various quantization formats. If you are deploying an AWQ or GPTQ version, ensure QUANTIZATION is set correctly (e.g., awq).
OpenChatopenchat/openchat-3.5-0106None requiredOpenChat relies heavily on specific chat templates. If the default templates produce poor results, use CUSTOM_CHAT_TEMPLATE to inject the precise Jinja2 template required for the OpenChat correction format.
Gemmagoogle/gemma-3-1b-itNone requiredGemma models require an active Hugging Face token. Ensure your HF_TOKEN is set as a secret. Gemma also performs best when DTYPE is explicitly set to bfloat16 to match its native training precision.
DeepSeek-R1deepseek-ai/DeepSeek-R1-Distill-Qwen-7BREASONING_PARSER=deepseek_r1Enables reasoning mode for chain-of-thought outputs.
Phi-4microsoft/Phi-4-mini-instructNone requiredPhi models are compact but have specific architectural quirks. Setting ENFORCE_EAGER=true can sometimes resolve initialization issues with Phi models on older CUDA versions, though it may slightly reduce performance compared to CUDA graphs.
Llama 3meta-llama/Llama-3.2-3B-InstructTOOL_CALL_PARSER=llama3_json ENABLE_AUTO_TOOL_CHOICE=trueLlama 3 models often require strict attention to context window limits. Use MAX_MODEL_LEN to prevent the KV cache from exceeding your GPU VRAM. If you are using a 24 GB GPU like a 4090, setting MAX_MODEL_LEN to 8192 or 16384 is a safe starting point.
Mistralmistralai/Ministral-8B-Instruct-2410TOKENIZER_MODE=mistral, CONFIG_FORMAT=mistral, LOAD_FORMAT=mistral, TOOL_CALL_PARSER=mistral ENABLE_AUTO_TOOL_CHOICE=trueMistral models use specialized tokenizers to work properly.

Selecting GPU size based on the model

Selecting the right GPU for vLLM is a balance between model size, quantization, and your required context length. Because vLLM pre-allocates memory for its KV (Key-Value) cache to enable high-throughput serving, you generally need more VRAM than the bare minimum required just to load the model.

VRAM estimation formula

A reliable rule of thumb for estimating the required VRAM for a model in vLLM is:
  • FP16/BF16 (unquantized): 2 bytes per parameter.
  • INT8 quantized: 1 byte per parameter.
  • INT4 (AWQ/GPTQ): 0.5 bytes per parameter.
  • KV cache buffer: vLLM typically reserves 10-30% of remaining VRAM for the KV cache to handle concurrent requests.
Use the table below as a starting point to select a hardware configuration for your model.
Model size (parameters)Recommended GPUsVRAM
Small (<10B)RTX 4090, A6000, L416–24 GB
Medium (10B–30B)A6000, L40S32–48 GB
Large (30B–70B)A100, H100, B20080–180 GB

Key factors

Here are some key factors to consider when selecting the right GPU for your model:

Context window vs. VRAM

The more context you need (e.g., 32k or 128k tokens), the more VRAM the KV cache consumes. If you encounter Out-of-Memory (OOM) errors, use the MAX_MODEL_LEN environment variable to cap the context. For example, a 7B model that OOMs at 32k context on a 24 GB card will often run perfectly at 16k.

GPU memory utilization

By default, vLLM attempts to use 90% of the available VRAM (GPU_MEMORY_UTILIZATION=0.90).
  • If you OOM during initialization: Lower this to 0.85.
  • If you have extra headroom: Increase it to 0.95 to allow for more concurrent requests.

Quantization (AWQ/GPTQ)

If you are limited by a single GPU, use a quantized version of the model (e.g., Meta-Llama-3-8B-Instruct-AWQ). This reduces the weight memory by 50-75% compared to FP16, allowing you to fit larger models on cards like the RTX 4090 (24 GB) or A4000 (16 GB).
For production workloads where high availability is key, always select multiple GPU types in your Serverless endpoint configuration. This allows the system to fall back to a different hardware tier if your primary choice is out of stock in a specific data center.

vLLM recipes

vLLM provides step-by-step recipes for common deployment scenarios, including deploying specific models, optimizing performance, and integrating with frameworks. Find the recipes at docs.vllm.ai/projects/recipes. They are community-maintained and updated regularly as vLLM evolves. You can often find further information in the documentation for the specific model you are deploying. For example: