TinyZero
Minimal reproduction of DeepSeek R1-Zero
A minimal research project that recreates the DeepSeek R1 Zero self-improving AI reasoning experiment on math tasks, reproducible for under $30 using two GPUs.
TinyZero is a research project that recreates a specific AI training experiment called DeepSeek R1 Zero, scaled down so it can run without a massive computing budget. The original DeepSeek R1 Zero showed that an AI model could teach itself to reason through problems step by step purely through trial and error, without being shown examples of correct reasoning first. TinyZero demonstrates that same effect on two math tasks: countdown (reaching a target number using arithmetic) and multiplication.
The core idea is reinforcement learning, which means the model is trained by giving it a problem, letting it attempt a solution, and then rewarding it when the answer is right. No step-by-step worked examples are provided. Over many training rounds, a 3-billion-parameter language model gradually develops what the authors describe as self-verification and search abilities, meaning it starts checking its own work and exploring different approaches before committing to an answer. The authors call this moment of capability emerging the "Aha moment," and they say you can reproduce it yourself for under $30 in cloud computing costs.
The project is built on top of an existing training library called veRL and uses Qwen2.5 series base models. Setup involves installing Python dependencies and running shell scripts to prepare data and launch training. Single-GPU training works for smaller models up to 1.5 billion parameters, while larger 3-billion-parameter models require two GPUs and show more meaningful reasoning improvements. The README provides exact commands for both configurations.
One thing to note: as of the time of archival, the authors have deprecated this repository and recommend using the veRL library directly for new reinforcement learning experiments. TinyZero remains available for reference, and the full training logs from the original experiments are publicly accessible online.
If you want to understand how modern AI reasoning models are trained without reading dense research papers, TinyZero is a concrete, runnable example that walks through the full process from data preparation to training completion.
Where it fits
- Train a small language model to solve math problems through reinforcement learning without providing worked examples.
- Reproduce the DeepSeek R1 Zero self-verification experiment at a fraction of the original computing cost.
- Use as a concrete runnable reference when learning how reinforcement learning is applied to language model training.