apex

Python ★ 9.0k updated 5d ago

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch

A Python library from NVIDIA that speeds up AI model training on NVIDIA GPUs by using 16-bit mixed-precision math and spreading computation across multiple GPUs or machines.

PythonPyTorchCUDAC++setup: hardcomplexity 4/5

Apex is a collection of tools from NVIDIA that makes training AI models faster and more efficient when using PyTorch on NVIDIA GPUs. PyTorch is a widely used framework for building and training machine learning models, and Apex adds features that NVIDIA developed specifically to get better performance out of their hardware.

The two main things Apex offers are mixed precision training and distributed training. Mixed precision means the model uses a combination of 16-bit and 32-bit numbers during training instead of always using 32-bit. This reduces memory usage and speeds up computation on modern NVIDIA GPUs, which have dedicated hardware for 16-bit math. Distributed training means spreading the work across multiple GPUs, or even multiple machines, so that large models can be trained faster by parallelizing the computation.

NVIDIA maintains Apex as a place to release optimized utilities quickly, before they might eventually be folded into the main PyTorch project. Some of the code here has already been or is planned to be incorporated into PyTorch itself.

Installing Apex requires either a compatible NVIDIA GPU or access to NVIDIA's pre-built container images. The full-performance version compiles custom C++ and CUDA extensions during installation, which requires a working CUDA toolkit. A simpler Python-only install is also available but runs slower because it skips the low-level compiled components.

This is a developer-facing library used during model training, not an end-user application. It is primarily useful for research teams or engineers who are training large neural networks and want to reduce training time or train models that would not otherwise fit in GPU memory.

Where it fits

Reduce memory usage and speed up a large PyTorch model's training time by switching to mixed-precision without rewriting your training loop.
Train a neural network across multiple GPUs or machines to finish in hours instead of days.
Fit a model into GPU memory that was previously too large to train at full 32-bit precision.

Open on GitHub → Full breakdown on explaingit →