LeetCUDA

Cuda ★ 11k updated 4h ago

📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉

A collection of 200+ working CUDA GPU programming examples organized by difficulty, covering high-performance matrix multiplication and Flash Attention, aimed at developers who want to learn GPU kernel development from first principles.

CUDAC++PythonPyTorchsetup: hardcomplexity 5/5

LeetCUDA is a collection of learning notes and working code examples for CUDA, a programming model used to run computations on NVIDIA graphics cards (GPUs). GPUs can process many operations in parallel, which makes them central to deep learning, scientific computing, and large-scale matrix calculations. Writing GPU code directly is considerably more complex than writing standard CPU code, and this repository is aimed at helping developers learn how to do it.

The collection includes more than 200 CUDA kernel implementations, organized by difficulty from easy through progressively harder levels. A kernel is a function that runs on the GPU. The examples range from basic operations to advanced techniques like matrix multiplication using Tensor Cores, which are specialized circuits on modern NVIDIA GPUs designed specifically to accelerate the kind of math used in neural networks.

Two major areas get extended treatment. The first is HGEMM, which stands for half-precision general matrix multiplication, a fundamental operation in training and running AI models. The implementations here reportedly reach 98 to 100 percent of the performance of NVIDIA's own cuBLAS library, which is the standard reference for GPU-accelerated linear algebra. The second area is Flash Attention, an algorithm that makes the attention mechanism in transformer models (the architecture behind most modern language models) faster and more memory-efficient.

The repository also links to more than 100 blog posts covering related GPU programming topics. PyTorch, a widely used Python library for machine learning, appears throughout the examples as a companion tool. The intended audience is developers who already have programming experience and want to learn GPU kernel development from first principles. The full README is longer than what was shown.

Where it fits

Learn GPU kernel programming by working through 200+ examples that escalate from basic operations to advanced Tensor Core techniques.
Study high-performance HGEMM implementations that match 98-100% of NVIDIA's cuBLAS library performance.
Implement Flash Attention in CUDA to make transformer model inference faster and more memory-efficient.
Use the linked blog posts alongside working code to understand GPU architecture concepts in practice.

Open on GitHub → Full breakdown on explaingit →