llama3-from-scratch
llama3 implementation one matrix multiplication at a time
A Jupyter Notebook that rebuilds Meta's Llama 3 language model from scratch in plain Python, walking through every matrix multiplication step so you can see exactly how it works inside.
This repository is a long, hand-walked tutorial that re-implements Meta's Llama 3 language model from scratch, one matrix multiplication at a time. Llama 3 is a large language model: software that takes some text and predicts the next piece of text. Rather than wrap it inside a black-box library call, the author loads Meta's published Llama 3 weights file directly and reconstructs every step the model takes in plain Python, narrating what is happening as the shapes of the numbers change.
The README walks through the pipeline a beginner needs to follow such a model. First it sets up a tokenizer (the piece that splits text into the numeric tokens the model actually processes), borrowing tiktoken to handle the byte-pair encoding rather than writing one. Then it reads the raw model file and inspects its config, which the file itself reports as 32 transformer layers, 32 attention heads, and a vocabulary of 128256 tokens. From there the notebook converts text to tokens, looks up token embeddings, applies RMS normalisation, and goes layer by layer through the transformer block, building queries, keys, values, and outputs for each attention head by hand.
You would read this repository if you already use large language models and want to understand what is actually happening inside one, or if you find it easier to learn by reading numeric code than reading a research paper. It is presented as a Jupyter Notebook and depends on PyTorch and tiktoken, both named in the README. Running it requires downloading the official Llama 3 weights from Meta. The full README is longer than what was provided.
Where it fits
- Walk through every step of a transformer forward pass using real Llama 3 weights to understand how large language models actually work.
- Use as a hands-on companion to understand attention heads, RMS normalization, and token embeddings alongside running code.
- Adapt the notebook to inspect how specific inputs flow through Llama 3 layers for research or debugging transformer internals.