llm.c

Cuda ★ 30k updated 11mo ago

LLM training in simple, raw C/CUDA

A minimal, from-scratch implementation of GPT-2/GPT-3 training in C and CUDA, cutting out framework overhead to show exactly how language model training works.

CCUDAPythonNVIDIA GPUsetup: hardcomplexity 4/5

llm.c is an implementation of large language model (LLM) training written entirely in C and CUDA — two low-level programming languages — without depending on large frameworks like PyTorch. The goal is to train the same kind of AI language models (specifically reproducing GPT-2 and GPT-3 class models) using code that is small, direct, and easy to read.

Most AI training code relies on heavyweight libraries that can weigh hundreds of megabytes. This project cuts all that away: the core single-GPU, full-precision training code fits in roughly 1,000 lines of C, and the optimized GPU version uses CUDA (a programming interface for NVIDIA graphics cards) to run faster than standard framework-based training. The repository also includes a parallel implementation in Python for comparison and testing.

You can run it on a CPU alone (useful for learning but slow for serious training), or on one or more NVIDIA GPUs for real training speed. It supports training on small datasets like a Shakespeare text corpus, and comes with scripts to download and tokenize data automatically.

Someone would use this if they want to understand exactly how LLM training works at a low level without layers of abstraction hiding the details, or if they are a systems programmer curious about GPU computing and AI, or if they want the fastest possible training without framework overhead.

Where it fits

Train GPT-2 or GPT-3 style models on your own data (e.g., Shakespeare corpus) with minimal dependencies.
Understand exactly how language model training works by reading clean, direct C code without framework abstractions.
Run fast GPU-accelerated training on NVIDIA hardware without PyTorch or TensorFlow overhead.

Open on GitHub → Full breakdown on explaingit →