gitmyhub

flashinfer

Python ★ 5.9k updated 44m ago

FlashInfer: Kernel Library for LLM Serving

FlashInfer is a Python library of highly optimized GPU kernels that makes AI language models run faster at inference time on Nvidia GPUs, reducing response latency and GPU costs for teams serving LLMs to users.

PythonCUDAC++setup: hardcomplexity 4/5

FlashInfer is a Python library that makes AI language models run faster on Nvidia GPUs. It does this by providing carefully optimized low-level code, called kernels, that handle the most computationally intense parts of running these models. The main operation it handles is called attention, a calculation that language models perform constantly to understand the relationships between words and tokens in a sequence.

When you run a large language model, the GPU spends a lot of time on attention calculations and on matrix multiplication. FlashInfer provides pre-written, highly tuned versions of these operations that run faster than default implementations. It supports Nvidia GPUs from the Turing generation (around 2018) through the latest Blackwell cards, and it automatically selects the best approach for your specific hardware.

The library is designed for people building production systems that serve AI models to users, not for those training models from scratch. It handles memory management techniques like paged KV-cache, which helps when you are serving many users at once with different conversation lengths. It also includes support for mixture-of-experts model architectures used by models like DeepSeek, and for speculative decoding, a technique that can increase output speed by generating and verifying multiple tokens in parallel.

Installation is done through pip. You can install the core package, which compiles or downloads the needed kernel code on first use, or install pre-compiled binaries to skip that step. The library also includes command-line tools for checking your setup, listing installed modules, and managing cached kernel files.

FlashInfer is suited for teams running inference infrastructure for large language models and looking to reduce latency or GPU costs. It sits at a lower level than frameworks like vLLM or TensorRT, and those frameworks sometimes use it underneath their own abstractions.

Where it fits