vLLM environment variables

Environment variables let you configure your vLLM workers without rebuilding your Docker image. You can customize model behavior, performance settings, and other deployment options to suit your needs. To set environment variables, go to your endpoint settings and add them under Environment Variables.

LLM settings

These variables control the core language model configuration.

Variable	Default	Type(s)	Description
`MODEL_NAME`	`facebook/opt-125m`	`str`	Hugging Face repo ID or local filesystem path for the model weights. Change this to deploy a different model.
`MODEL_REVISION`	`main`	`str`	The model revision to load.
`TOKENIZER`	None	`str`	The name or path of the Hugging Face tokenizer to use.
`SKIP_TOKENIZER_INIT`	`False`	`bool`	If `True`, skips the initialization of the tokenizer and detokenizer.
`TOKENIZER_MODE`	`auto`	`auto`, `slow`	The tokenizer mode.
`TRUST_REMOTE_CODE`	`False`	`bool`	If `True`, trusts remote code from Hugging Face.
`DOWNLOAD_DIR`	None	`str`	The directory to download and load the model weights from.
`LOAD_FORMAT`	`auto`	`str`	The format of the model weights to load.
`HF_TOKEN`	-	`str`	Hugging Face token required to download gated/private models. Not needed for public models. Provide it via secrets.
`DTYPE`	`auto`	`auto`, `half`, `float16`, `bfloat16`, `float`, `float32`	The data type for model weights and activations.
`KV_CACHE_DTYPE`	`auto`	`auto`, `fp8`	The data type for KV cache storage.
`QUANTIZATION_PARAM_PATH`	None	`str`	The path to the JSON file containing the KV cache scaling factors.
`MAX_MODEL_LEN`	None	`int`	Maximum context length (tokens) the engine will allocate KV cache for. Lower it to reduce VRAM usage; raise it for long-context models if supported and you have VRAM.
`GUIDED_DECODING_BACKEND`	`outlines`	`outlines`, `lm-format-enforcer`	The default engine for guided decoding.
`DISTRIBUTED_EXECUTOR_BACKEND`	None	`ray`, `mp`	The backend to use for distributed serving.
`WORKER_USE_RAY`	`False`	`bool`	Deprecated. Use `DISTRIBUTED_EXECUTOR_BACKEND=ray` instead.
`PIPELINE_PARALLEL_SIZE`	`1`	`int`	The number of pipeline stages.
`TENSOR_PARALLEL_SIZE`	`1`	`int`	Tensor parallel degree (number of GPUs to shard across). On multi-GPU machines, this worker auto-sets it to the number of visible GPUs.
`MAX_PARALLEL_LOADING_WORKERS`	None	`int`	Load model sequentially in multiple batches to avoid RAM OOM when using tensor parallelism and large models.
`RAY_WORKERS_USE_NSIGHT`	`False`	`bool`	If `True`, uses nsight to profile Ray workers.
`ENABLE_PREFIX_CACHING`	`False`	`bool`	If `True`, enables automatic prefix caching.
`DISABLE_SLIDING_WINDOW`	`False`	`bool`	If `True`, disables the sliding window, capping to the sliding window size.
`USE_V2_BLOCK_MANAGER`	`False`	`bool`	If `True`, uses the BlockSpaceMangerV2.
`NUM_LOOKAHEAD_SLOTS`	`0`	`int`	The number of lookahead slots, an experimental scheduling configuration for speculative decoding.
`SEED`	`0`	`int`	The random seed for operations.
`NUM_GPU_BLOCKS_OVERRIDE`	None	`int`	If specified, this value overrides the GPU profiling result for the number of GPU blocks.
`MAX_NUM_BATCHED_TOKENS`	None	`int`	The maximum number of batched tokens per iteration.
`MAX_NUM_SEQS`	`256`	`int`	Upper bound on sequences batched per iteration (affects throughput, VRAM, and tail latency). Higher can improve throughput for many concurrent short requests; lower reduces VRAM usage.
`MAX_LOGPROBS`	`20`	`int`	The maximum number of log probabilities to return when `logprobs` is specified in `SamplingParams`.
`DISABLE_LOG_STATS`	`False`	`bool`	If `True`, disables logging statistics.
`QUANTIZATION`	None	`awq`, `squeezellm`, `gptq`, `bitsandbytes`	Quantization backend for loading quantized checkpoints (AWQ/GPTQ/…) or BitsAndBytes. Must match the checkpoint format.
`ROPE_SCALING`	None	`dict`	The RoPE scaling configuration in JSON format.
`ROPE_THETA`	None	`float`	The RoPE theta value. Use with `ROPE_SCALING`.
`TOKENIZER_POOL_SIZE`	`0`	`int`	The size of the tokenizer pool for asynchronous tokenization.
`TOKENIZER_POOL_TYPE`	`ray`	`str`	The type of the tokenizer pool for asynchronous tokenization.
`TOKENIZER_POOL_EXTRA_CONFIG`	None	`dict`	Extra configuration for the tokenizer pool.

LoRA settings

Configure LoRA (Low-Rank Adaptation) adapters for your model.

Variable	Default	Type	Description
`ENABLE_LORA`	`False`	`bool`	If `True`, enables the handling of LoRA adapters.
`MAX_LORAS`	`1`	`int`	The maximum number of LoRAs in a single batch.
`MAX_LORA_RANK`	`16`	`int`	The maximum LoRA rank.
`LORA_EXTRA_VOCAB_SIZE`	`256`	`int`	The maximum size of the extra vocabulary for LoRA adapters.
`LORA_DTYPE`	`auto`	`auto`, `float16`, `bfloat16`, `float32`	The data type for LoRA.
`LONG_LORA_SCALING_FACTORS`	None	`tuple`	Specifies multiple scaling factors for LoRA adapters.
`MAX_CPU_LORAS`	None	`int`	The maximum number of LoRAs to store in CPU memory.
`FULLY_SHARDED_LORAS`	`False`	`bool`	If `True`, enables fully sharded LoRA layers.
`LORA_MODULES`	`[]`	`list[dict]`	A list of LoRA adapters to add from Hugging Face. Example: `[{"name": "adapter1", "path": "user/adapter1"}]`

Speculative decoding settings

Configure speculative decoding to improve inference performance.

Variable	Default	Type(s)	Description
`SCHEDULER_DELAY_FACTOR`	`0.0`	`float`	Applies a delay before scheduling the next prompt.
`ENABLE_CHUNKED_PREFILL`	`False`	`bool`	If `True`, enables chunked prefill requests.
`SPECULATIVE_MODEL`	None	`str`	The name of the draft model for speculative decoding.
`NUM_SPECULATIVE_TOKENS`	None	`int`	The number of speculative tokens to sample from the draft model.
`SPECULATIVE_DRAFT_TENSOR_PARALLEL_SIZE`	None	`int`	The number of tensor parallel replicas for the draft model.
`SPECULATIVE_MAX_MODEL_LEN`	None	`int`	The maximum sequence length supported by the draft model.
`SPECULATIVE_DISABLE_BY_BATCH_SIZE`	None	`int`	Disables speculative decoding if the number of enqueued requests is larger than this value.
`NGRAM_PROMPT_LOOKUP_MAX`	None	`int`	The maximum window size for ngram prompt lookup in speculative decoding.
`NGRAM_PROMPT_LOOKUP_MIN`	None	`int`	The minimum window size for ngram prompt lookup in speculative decoding.
`SPEC_DECODING_ACCEPTANCE_METHOD`	`rejection_sampler`	`rejection_sampler`, `typical_acceptance_sampler`	The acceptance method for draft token verification in speculative decoding.
`TYPICAL_ACCEPTANCE_SAMPLER_POSTERIOR_THRESHOLD`	None	`float`	Sets the lower bound threshold for the posterior probability of a token to be accepted.
`TYPICAL_ACCEPTANCE_SAMPLER_POSTERIOR_ALPHA`	None	`float`	A scaling factor for the entropy-based threshold for token acceptance.

System performance settings

Configure GPU memory and system resource utilization.

Variable	Default	Type(s)	Description
`GPU_MEMORY_UTILIZATION`	`0.95`	`float`	Fraction of GPU VRAM vLLM is allowed to use for KV cache and runtime allocations. Lower if you hit CUDA OOM; raise if you have VRAM headroom.
`MAX_PARALLEL_LOADING_WORKERS`	None	`int`	Loads the model sequentially in multiple batches to avoid RAM OOM when using tensor parallelism and large models.
`BLOCK_SIZE`	`16`	`8`, `16`, `32`	The token block size for contiguous chunks of tokens.
`SWAP_SPACE`	`4`	`int`	The CPU swap space size (in GiB) per GPU.
`ENFORCE_EAGER`	`False`	`bool`	If `True`, always uses eager-mode PyTorch. If `False`, uses a hybrid of eager mode and CUDA graphs for maximal performance and flexibility.
`MAX_SEQ_LEN_TO_CAPTURE`	`8192`	`int`	Maximum context length covered by CUDA graphs. When a sequence has context length larger than this, we fall back to eager mode.
`DISABLE_CUSTOM_ALL_REDUCE`	`0`	`int`	If `0`, enables custom all-reduce. If `1`, disables it.
`ENABLE_EXPERT_PARALLEL`	`False`	`bool`	Enable expert parallelism for Mixture of Experts (MoE) models.

Tokenizer settings

Customize tokenizer behavior and chat templates.

Variable	Default	Type(s)	Description
`TOKENIZER_NAME`	None	`str`	The tokenizer repository to use a different tokenizer than the model’s default.
`TOKENIZER_REVISION`	None	`str`	The tokenizer revision to load.
`CUSTOM_CHAT_TEMPLATE`	None	`str` of single-line jinja template	Override the model chat template (single-line Jinja2). Useful when sending `messages` to a base model without a built-in chat template. See the Hugging Face documentation for more information.

Streaming and batch settings

Control how tokens are streamed back in HTTP responses. These settings control how tokens are batched in HTTP responses when streaming. The batch size starts at DEFAULT_MIN_BATCH_SIZE and increases by a factor of DEFAULT_BATCH_SIZE_GROWTH_FACTOR with each request until it reaches DEFAULT_BATCH_SIZE. For example, with default values, the batch sizes would be 1, 3, 9, 27, and then 50 for all subsequent requests. These settings do not affect vLLM’s internal batching.

Variable	Default	Type(s)	Description
`DEFAULT_BATCH_SIZE`	`50`	`int`	The default and maximum batch size for token streaming.
`DEFAULT_MIN_BATCH_SIZE`	`1`	`int`	The initial batch size for the first request.
`DEFAULT_BATCH_SIZE_GROWTH_FACTOR`	`3`	`float`	The growth factor for the dynamic batch size.

OpenAI compatibility settings

Configure OpenAI API compatibility features.

Variable	Default	Type(s)	Description
`RAW_OPENAI_OUTPUT`	`1`	boolean as `int`	If `1`, enables raw OpenAI SSE format string output when streaming. This is required for OpenAI compatibility.
`OPENAI_SERVED_MODEL_NAME_OVERRIDE`	None	`str`	Exposes a custom model ID via `/v1/models` and accepts it as the `model` field in OpenAI requests (alias for the served model).
`OPENAI_RESPONSE_ROLE`	`assistant`	`str`	The role of the LLM’s response in OpenAI chat completions.
`ENABLE_AUTO_TOOL_CHOICE`	`false`	`bool`	Enables vLLM automatic tool selection for OpenAI Chat Completions. Only enable for tool-capable models.
`TOOL_CALL_PARSER`	None	`str`	Tool-call parser that matches your model’s tool-call format (required for most tool-calling models). Supported values: `mistral`, `hermes`, `llama3_json`, `llama4_json`, `llama4_pythonic`, `granite`, `granite-20b-fc`, `deepseek_v3`, `internlm`, `jamba`, `phi4_mini_json`, `pythonic`.
`REASONING_PARSER`	None	`str`	Parser for reasoning-capable models (enables reasoning mode). Examples: `deepseek_r1`, `qwen3`, `granite`, `hunyuan_a13b`. Leave unset to disable.

TOOL_CALL_PARSER tells vLLM how to interpret a model’s tool-call output. If the parser doesn’t match the model’s format, tool calls may not be detected (or may error during parsing).

Serverless and concurrency settings

Configure concurrency and logging for Serverless deployments.

Variable	Default	Type(s)	Description
`MAX_CONCURRENCY`	`30`	`int`	Max concurrent requests per worker instance (Runpod-side). Not a vLLM engine arg; it controls Runpod worker concurrency and affects how requests are fed into vLLM (queueing/throughput/latency).
`DISABLE_LOG_STATS`	`False`	`bool`	If `False`, enables vLLM stats logging.
`DISABLE_LOG_REQUESTS`	`False`	`bool`	If `False`, enables vLLM request logging.

Advanced settings

Additional configuration options for specialized use cases.

Variable	Default	Type	Description
`MODEL_LOADER_EXTRA_CONFIG`	None	`dict`	Extra configuration for the model loader.
`PREEMPTION_MODE`	None	`str`	The preemption mode. If `recompute`, the engine performs preemption-aware recomputation. If `save`, the engine saves activations to CPU memory during preemption.
`PREEMPTION_CHECK_PERIOD`	`1.0`	`float`	The frequency (in seconds) at which the engine checks for preemption.
`PREEMPTION_CPU_CAPACITY`	`2`	`float`	The percentage of CPU memory to use for saved activations.
`DISABLE_LOGGING_REQUEST`	`False`	`bool`	If `True`, disables logging requests.
`MAX_LOG_LEN`	None	`int`	The maximum number of prompt characters or prompt ID numbers to print in the log.

Docker build arguments

These variables are used when building custom Docker images with models baked in.

Variable	Default	Type	Description
`BASE_PATH`	`/runpod-volume`	`str`	The storage directory for the Hugging Face cache and model.
`WORKER_CUDA_VERSION`	`12.1.0`	`str`	The CUDA version for the worker image.

Deprecated variables

The following variables are deprecated and will be removed in future versions.

Old Variable	New Variable	Note
`MAX_CONTEXT_LEN_TO_CAPTURE`	`MAX_SEQ_LEN_TO_CAPTURE`	Use the new variable name.
`kv_cache_dtype=fp8_e5m2`	`kv_cache_dtype=fp8`	Simplified fp8 format.

Get started

Serverless

Pods

Storage

Public Endpoints

Instant Clusters

Fine-tuning

Hub

Reference

LLM settings

LoRA settings

Speculative decoding settings

System performance settings

Tokenizer settings

Streaming and batch settings

OpenAI compatibility settings

Serverless and concurrency settings

Advanced settings

Docker build arguments

Deprecated variables

Get started

Serverless

Pods

Storage

Public Endpoints

Instant Clusters

Fine-tuning

Hub

Reference

​LLM settings

​LoRA settings

​Speculative decoding settings

​System performance settings

​Tokenizer settings

​Streaming and batch settings

​OpenAI compatibility settings

​Serverless and concurrency settings

​Advanced settings

​Docker build arguments

​Deprecated variables

LLM settings

LoRA settings

Speculative decoding settings

System performance settings

Tokenizer settings

Streaming and batch settings

OpenAI compatibility settings

Serverless and concurrency settings

Advanced settings

Docker build arguments

Deprecated variables