Megatron-LM

Python ★ 17k updated 15h ago

Ongoing research training transformer models at scale

NVIDIA's Python library for training very large AI language models, from 2 billion to hundreds of billions of parameters, across thousands of GPUs simultaneously, using advanced parallelism built for research and production scale.

PythonPyTorchCUDAsetup: hardcomplexity 5/5

Megatron-LM is a GPU-optimized Python library from NVIDIA for training very large transformer models — the class of AI architectures that powers modern large language models. It is designed for research teams and ML engineers who need to train models ranging from 2 billion to hundreds of billions of parameters across thousands of GPUs simultaneously.

The repository contains two main components. Megatron-LM is the higher-level reference implementation with pre-configured training scripts, useful for learning or experimentation. Megatron Core is the lower-level, composable library that framework developers can use to build custom training pipelines.

The core technical challenge it solves is distributing model training across many GPUs efficiently, through multiple parallelism strategies: tensor parallelism (splitting individual operations across GPUs), pipeline parallelism (splitting model layers across GPUs), and data parallelism (running the same model on different data batches in parallel). It also supports mixed precision training — using lower-precision number formats like FP8 and BF16 to speed up computation. According to the benchmarks, it achieves up to 47% Model FLOP Utilization (a measure of hardware efficiency) on H100 GPU clusters, tested up to a 462-billion parameter model on 6,144 GPUs.

You would use Megatron-LM if you are training or fine-tuning large language models at research or production scale and need tooling designed to work across large GPU clusters. The full README is longer than what was provided.

Where it fits

Train a custom large language model with hundreds of billions of parameters across a multi-GPU cluster
Fine-tune an existing large model using Megatron Core's composable pipeline designed for framework developers
Benchmark GPU cluster efficiency for LLM training using tensor, pipeline, and data parallelism together
Test FP8 and BF16 mixed-precision training to speed up compute on H100 GPU hardware

Open on GitHub → Full breakdown on explaingit →