accelerate

Python ★ 9.7k updated 1d ago

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support

Accelerate is a Hugging Face library that lets you run the same PyTorch training code on one CPU, multiple GPUs, or a multi-machine cluster by adding about five lines of setup.

PythonPyTorchDeepSpeedFSDPsetup: hardcomplexity 4/5

Accelerate is a Python library from Hugging Face that lets you run the same AI model training code across many different hardware setups without rewriting it. If you have written a training loop in PyTorch (the code that repeatedly feeds data to a model and updates its weights), Accelerate handles the extra complexity of spreading that work across multiple GPUs, multiple machines, or specialized chips called TPUs. You add roughly five lines to your existing code, and the same script then works on a single laptop CPU, a single GPU, or a cluster of machines.

The core idea is that all the tedious setup code for distributed training, like deciding which machine should print logs or how to combine results from multiple GPUs, gets handled behind the scenes. You keep your training loop exactly as you wrote it. Accelerate wraps the model, the optimizer, and the data loader, then takes care of coordination automatically.

It also includes a command-line tool called accelerate config that asks you a few questions about your hardware and saves a configuration file. After that, running accelerate launch my_script.py starts your training with the right settings applied. You do not need to memorize the flags for PyTorch's own distributed launcher.

Accelerate supports mixed precision training, which uses lower-precision numbers (like fp16 or bf16) to speed up computation and reduce memory use. It also works with two popular tools for training very large models: DeepSpeed and PyTorch FSDP. Both allow a single large model to be split across multiple GPUs when it would not otherwise fit in memory.

This library is aimed at developers who want direct control over their training code and do not want a higher-level framework making decisions for them. Several other well-known projects, including Hugging Face Transformers and the Stable Diffusion web UI, use Accelerate internally as their training backend.

Where it fits

Run the same PyTorch training script on a single GPU, multiple GPUs, or a multi-machine cluster without rewriting the code.
Speed up model training using mixed precision (fp16 or bf16) by adding a single configuration flag.
Train very large AI models that don't fit in one GPU's memory using DeepSpeed or PyTorch FSDP to split the model across GPUs.
Add distributed training to an existing PyTorch training loop by wrapping the model, optimizer, and data loader, about five lines of change.

Open on GitHub → Full breakdown on explaingit →