SageAttention

Cuda ★ 3.4k updated 5mo ago

[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.

No plain-English explanation yet — one is being written right now. Check back in a minute.

Open on GitHub → Full breakdown on explaingit →