slime

Python ★ 7.3k updated 20h ago

slime is an LLM post-training framework for RL Scaling.

A Python framework from Tsinghua University for post-training large language models with reinforcement learning, powers the GLM model family and supports training Qwen, DeepSeek V3, and Llama 3.

PythonPyTorchMegatronSGLangCUDAsetup: hardcomplexity 5/5

slime is a Python framework for post-training large language models using reinforcement learning. Post-training refers to the step that happens after an AI model has been initially trained: you take that model and further improve its behavior using feedback signals, often to make it better at following instructions or reasoning through problems. Reinforcement learning, in this context, means the model is rewarded for producing good outputs and learns to do more of what works.

The framework comes from Tsinghua University and has powered several generations of the GLM model family, including GLM-5.1, GLM-5, and earlier versions. It also supports training Qwen models, DeepSeek V3 models, and Llama 3.

slime connects two underlying systems to do its work. The training side uses Megatron, a library for efficiently training large models across many GPUs. The inference side uses SGLang, a fast serving engine that generates text at scale. Between them sits a data buffer that manages what prompts and generated examples flow into training. This separation means the system can generate new training data and run model updates at the same time, which is more efficient than doing them sequentially.

The framework also provides flexible interfaces for custom data generation workflows, so researchers can define their own reward signals or data pipelines without rewriting the core infrastructure.

Several external projects have been built on top of slime, ranging from physics reasoning models trained entirely through reinforcement learning, to tools for generating optimized GPU kernels, to multi-modal agent training systems. The README links to each of these as examples of what the framework can support.

Documentation and a quick start guide are available in the repository, and contributions are welcome.

Where it fits

Fine-tune a large language model like Llama 3 or Qwen using reinforcement learning with custom reward signals
Build a data generation pipeline for RL-based post-training without rewriting the core training infrastructure
Reproduce or extend the training methodology used for Tsinghua's GLM model series

Open on GitHub → Full breakdown on explaingit →