FlashMLA

C++ ★ 13k updated 1mo ago

FlashMLA: Efficient Multi-head Latent Attention Kernels

Highly optimized GPU attention kernels from DeepSeek that accelerate the most expensive part of running large language models, targeting NVIDIA H800 and B200 hardware specifically.

C++CUDAPythonPyTorchsetup: hardcomplexity 5/5

FlashMLA is a collection of highly optimized low-level computation routines released by DeepSeek, the company behind the DeepSeek-V3 series of AI models. These routines handle a specific operation called attention, which is one of the most computationally expensive parts of running or training large language models. The library is written to extract maximum performance from specific NVIDIA GPU hardware.

The library provides two broad categories of attention computation. Dense attention processes every token in a sequence, while sparse attention selectively processes only the most relevant tokens, reducing computation without sacrificing much accuracy. Both categories include variants optimized for the two main phases of AI model inference: the prefill phase, which processes the initial input prompt, and the decoding phase, which generates output tokens one at a time.

FlashMLA is intended for AI researchers and engineers who are running or building large language model inference systems, particularly those working with DeepSeek models. It is not a general-purpose library and requires specific high-end NVIDIA GPUs (the H800 or B200 class) along with recent versions of CUDA and PyTorch. The performance numbers cited in the documentation are measured in teraflops, a unit describing hundreds of trillions of calculations per second, which gives a sense of how specialized this code is.

Installation involves cloning the repository and running a standard Python package install command. Usage requires calling a small set of Python-facing functions that wrap the underlying GPU kernels. The full README is longer than what was shown.

Where it fits

Speed up large language model inference by replacing standard attention with FlashMLA's hardware-optimized GPU kernels.
Build a high-throughput DeepSeek-V3 inference pipeline using separate prefill and decode phase attention routines.
Benchmark attention operation throughput in teraflops on H800 or B200 GPUs to compare against baseline implementations.
Reduce per-token latency during the decode phase of LLM inference using sparse attention to skip less relevant tokens.

Open on GitHub → Full breakdown on explaingit →