vLLM 0.20.x Adds DeepSeek V4, CUDA 13.0 Default

What Shipped

The three-release sequence spans a major feature drop and two targeted stabilization patches. Version 0.20.0, which drew 752 commits from 320 contributors — 123 of them new — established the headline changes: DeepSeek V4 initial support, the CUDA 13.0 default, the PyTorch 2.11 dependency bump, FlashAttention 4 as the default MLA prefill backend, and a broadened model roster [3]. Version 0.20.1 followed as a patch focused almost entirely on DeepSeek V4 stabilization and performance, alongside a handful of general bug fixes [2]. Version 0.20.2 then addressed a narrower set of issues — a persistent topk hang on Hopper hardware, a KV cache allocation failure in the V1 engine, and a Qwen3-VL boundary check error — contributed by six developers [1].

DeepSeek V4 Integration

DeepSeek V4 support arrived in v0.20.0 with an initial landing commit, accompanied by a token-leakage fix in DSV4/3.2, a DSA and MTP IMA fix, and a silu clamp limit on the shared expert [3]. The model’s complexity drove the bulk of v0.20.1’s work. That release added base model support, multi-stream pre-attention GEMM with a configurable knob and tuned default threshold, BF16 and MXFP8 all-to-all support for FlashInfer one-sided communication, and a PTX cvt instruction for faster FP32-to-FP4 conversion [2].

Sparse attention proved particularly problematic. Version 0.20.1 fixed a persistent topk cooperative deadlock at TopK=1024 and an inter-CTA initialization race on RadixRowState, temporarily disabling the persistent topk path as a workaround [2]. Version 0.20.2 re-enabled that path on Hopper hardware, ensuring the memset kernel runs at CUDA graph capture time regardless of max_seq_len, which resolved an MTP=1 hang [1]. The same patch release also corrected a “failure to allocate KV blocks” error in the V1 engine KV cache manager that had been introduced with the DeepSeek V4 changes [1].

CUDA 13.0 and PyTorch 2.11 Upgrade

Starting with v0.20.0, the default PyPI wheel and the vllm/vllm-openai:v0.20.0 Docker image ship against CUDA 13.0, specifically bumped to 13.0.2 to align with PyTorch 2.11.0 [3]. The project’s stated policy is to track PyTorch’s CUDA version choices going forward.

The PyTorch 2.11 upgrade is explicitly flagged as a breaking change for environment dependencies [3]. XPU support, previously pinned to PyTorch 2.10, also moves to 2.11 in this release. Operators running on CUDA 12.9 hardware are directed to install vLLM using uv with the --torch-backend=cu129 flag rather than the default wheel [3].

Additional Model and Framework Support

Version 0.20.0 expanded the supported model roster with Hunyuan v3 (Hy3) in preview form, including a dedicated HYV3 reasoning parser, and added Granite 4.1 Vision as a built-in multimodal model [3]. BailingMoE V2.5 received fixes in v0.20.1, which corrected a linear layer bug and an MLA RoPE rotation issue [2].

On the framework side, vLLM now runs against HuggingFace Transformers v5 (transformers>=5), with a vision-encoder torch.compile bypass and a series of v4/v5 compatibility fixes covering PaddleOCR-VL image processor parameters, a Mistral YaRN warning, and a Jina ColBERT rotary inverse-frequency recompute [3]. Python 3.14 was also added to the supported version list in v0.20.0 [3].

Inference Backend Changes

FlashAttention 4 becomes the default MLA prefill backend in v0.20.0, re-enabled after earlier caution, with head-dimension 512 and paged-KV support on SM90+ hardware [3]. The release also introduced TurboQuant 2-bit KV cache compression — described as delivering 4x capacity — with FA3 and FA4 prefill support added alongside [3]. A new online quantization frontend consolidates experts_int8 into the FP8 online path and adds MXFP8 online quantization [3]. In v0.20.2, a fix ensures MXFP4 works correctly under torch.compile for gpt-oss models by threading hidden_dim_unpadded through the moe_forward fake operation [1].

Upgrade Considerations

Operators planning to upgrade face two primary environment-level changes. The PyTorch 2.11 dependency is a declared breaking change, meaning existing virtual environments built against PyTorch 2.10 or earlier will require rebuilding [3]. The CUDA 13.0 default affects both pip installs and Docker-based deployments; teams on CUDA 12.9 must explicitly opt into the older backend via the --torch-backend=cu129 flag [3].

Beyond environment setup, the patch release history points to several categories of bugs that affected early v0.20.0 adopters running DeepSeek V4: sparse attention hangs on Hopper GPUs, KV cache allocation failures in the V1 engine, and missing type conversions for non-streaming tool calls in DSV3.2 and V4 [1][2]. Teams running Qwen3-VL under heavy load should also note the boundary check fix shipped in v0.20.2 [1]. Reviewing the full changelogs for v0.20.1 and v0.20.2 before deploying v0.20.0 in production is advisable for any operator running these model families.

FAQ

Q. Is the CUDA 13.0 wheel compatible with CUDA 12.x driver installations? CUDA 13.0 runtime wheels generally require a driver that supports CUDA 13.0. The vLLM project explicitly recommends using uv with --torch-backend=cu129 for environments running CUDA 12.9 drivers rather than using the default PyPI wheel [3].

Q. What does the PyTorch 2.11 breaking change mean for existing deployments? The project labels the PyTorch 2.11 upgrade a breaking change for environment dependencies, meaning any environment pinned to PyTorch 2.10 or earlier must be rebuilt before installing vLLM 0.20.x [3]. XPU deployments previously pinned to 2.10 are also affected.

Q. Are the DeepSeek V4 sparse attention issues fully resolved in v0.20.2? Version 0.20.2 re-enables the persistent topk path on Hopper hardware and fixes the MTP=1 hang that prompted its temporary disablement in v0.20.1 [1][2]. The KV cache allocation error in the V1 engine is also patched in v0.20.2 [1], though operators should monitor upstream issue trackers for any further regressions.

Q. Does FlashAttention 4 as the default MLA prefill backend affect all GPU generations? The FA4 MLA prefill path with head-dimension 512 and paged-KV support is targeted at SM90+ hardware [3]. Deployments on older GPU architectures may follow different backend selection logic, and operators should verify backend selection in their specific hardware configurations.

Q. Which models beyond DeepSeek V4 received fixes in the patch releases? Version 0.20.1 addressed a linear layer bug and an MLA RoPE rotation issue in BailingMoE V2.5 [2]. Version 0.20.2 fixed an invalid boundary check in Qwen3-VL that could cause failures under heavy load, and corrected MXFP4 behavior under torch.compile for gpt-oss models [1].

Key takeaways

vLLM 0.20.0 ships initial DeepSeek V4 support, with v0.20.1 and v0.20.2 resolving sparse attention hangs, KV cache allocation failures, and multi-stream GEMM issues across the model [1][2][3].
The default PyPI wheel and Docker image now target CUDA 13.0; operators on CUDA 12.9 must use --torch-backend=cu129 to avoid environment mismatches [3].
PyTorch 2.11 is a declared breaking dependency change, requiring environment rebuilds for any deployment previously on 2.10 or earlier [3].
FlashAttention 4 becomes the default MLA prefill backend on SM90+ hardware, and TurboQuant 2-bit KV cache compression gains FA3 and FA4 prefill support [3].
New model additions include Hunyuan v3, Granite 4.1 Vision, and BailingMoE V2.5 fixes, alongside Transformers v5 compatibility and Python 3.14 support [2][3].