vllm-nvfp4-kv-sm120

Python ★ 15 updated 14d ago

NVFP4 KV cache for vLLM on SM120 (RTX PRO 6000) via FlashInfer FA2 explicit-SF-stride patch — ~1.5x fp8 pool at ~95-104% speed

A patch for vLLM that enables NVFP4 KV cache compression on Blackwell-architecture GPUs like the RTX PRO 6000, fitting roughly 78 percent more context into the same GPU memory.

PythonvLLMFlashInferDockerCUDAsetup: hardcomplexity 5/5

This repository is a patch for vLLM, an open-source tool used to run large AI language models at high throughput. The patch unlocks a specific memory-saving feature on a particular class of NVIDIA GPU, the RTX PRO 6000 based on NVIDIA's Blackwell architecture.

The feature in question is called NVFP4 KV cache. When a language model processes text, it stores intermediate data called the KV cache that grows with the length of the conversation or document being processed. Storing that cache in a highly compressed format (NVFP4) means you can fit more of it in GPU memory, which allows longer conversations or more simultaneous users. The issue is that vLLM's built-in support for NVFP4 KV cache depends on pre-compiled GPU code that NVIDIA has not released for this GPU family. The patch works around that by routing the NVFP4 computation through a different, more general code path that does work on these GPUs.

The practical result measured by the author on a two-GPU setup running a 198-billion-parameter model: the patch fits about 1.78 times as many tokens in the KV cache compared to the previous best option, at roughly the same decoding speed (within a few percent). That means about 78 percent more context length capacity, or more concurrent users, without buying more hardware.

The patch modifies four files in vLLM and FlashInfer (a GPU attention library). It is version-pinned to specific releases of both libraries. Installation is either through Docker, by building a patched container image, or by running a shell script that copies the modified files over the installed versions. The repository includes test scripts so you can verify the patch works correctly on your specific model shape before using it in production.

This is a low-level engineering patch intended for people already running vLLM on Blackwell GPUs and looking to increase how much context their deployment can hold. It is Apache-2.0 licensed.

Where it fits

Fit roughly 78 percent more context into GPU memory on an RTX PRO 6000 without buying additional hardware.
Deploy the patch via Docker to run vLLM with NVFP4 KV cache on a Blackwell GPU setup.
Verify the patch works on your specific model shape using the included test scripts before enabling it in production.

Open on GitHub → Full breakdown on explaingit →