tiny-gpu

SystemVerilog ★ 13k updated 1y ago

A minimal GPU design in Verilog to learn how GPUs work from the ground up

A minimal GPU built in Verilog (a hardware design language) specifically to teach how GPUs work internally, covering the same core ideas as real AI training chips, in under 15 readable files.

SystemVerilogVerilogsetup: moderatecomplexity 5/5

Tiny-gpu is a minimal GPU implementation written in Verilog, a hardware description language used to design digital circuits. The project is explicitly built for learning: it strips away the complexity of real graphics cards to expose the core architectural ideas that all GPUs share, including the kind of general-purpose computing chips used in AI training.

The README opens by noting that while there are many resources for learning how CPUs work at a hardware level, the GPU market is so competitive that low-level architectural details stay proprietary. This project fills that gap by building a simplified but functional GPU from scratch in under 15 well-commented files.

The architecture covers the main components found in a real GPU: a dispatcher that breaks work into thread groups (called blocks) and assigns them to compute cores; memory controllers that manage the bottleneck between the cores and external memory; a cache for storing recently fetched data to avoid redundant memory trips; and individual compute cores, each of which contains a scheduler, an instruction fetcher, a decoder, and per-thread resources (ALU for arithmetic, LSU for memory loads and stores, a program counter, and register files). The register files hold data specific to each thread, which is how the same instruction can operate on different data in parallel across many threads at once.

The project also includes a custom instruction set (ISA), working example kernels for matrix addition and matrix multiplication, and tooling to simulate kernel execution and view execution traces. The documentation explains not just how to use it, but why each design decision was made.

The repo notes areas where production GPUs go further, such as warp scheduling and pipelining, and points to those as next steps for anyone who wants to go deeper after working through the basics.

Where it fits

Study how a real GPU dispatches threads and manages memory by reading a simplified but working implementation.
Run the included matrix addition and matrix multiplication kernels in simulation to see how parallel thread execution works step by step.
Use the execution trace tooling to visualize what happens inside the GPU when a kernel runs.
Extend the custom instruction set or add warp scheduling to go deeper after mastering the basics.

Open on GitHub → Full breakdown on explaingit →