DiffusionOPD

Python ★ 107 updated 23d ago

DiffusionOPD: A Unified Perspective of On-Policy Distillation in Diffusion Models

DiffusionOPD is a research codebase from Alibaba's visual AI lab that introduces a new training method for image generation models. Specifically, it addresses the challenge of making a single image generation model perform well across multiple quality criteria at the same time, such as visual aesthetics, text accuracy in generated images, and whether generated scenes contain the right objects.

The core problem it solves is one of interference between goals. When you train an image model to improve on one metric, that training can hurt performance on another. DiffusionOPD handles this by separating the training into two stages. In the first stage, specialist teacher models are trained independently, each focused on one quality dimension. In the second stage, a single student model learns from all those teachers at once by generating images along its own current behavior, then comparing what it produced against what each teacher would have done at each step of the image generation process.

The technical contribution involves adapting a training idea from language model research (called on-policy distillation, which is about learning from your own outputs rather than fixed examples) to work with diffusion models, which generate images through a gradual denoising process rather than producing tokens one at a time. The paper derives a mathematical objective for this that avoids some of the noise and instability common in other reinforcement learning approaches for image generation.

In the results shown in the README, the method outperforms baselines that either try to optimize multiple rewards simultaneously from the start or chain multiple training stages together, across evaluations for aesthetics, text rendering in images, and compositional image generation.

Setting up the project requires downloading several pretrained models including Stable Diffusion 3.5 and three teacher models hosted on HuggingFace, along with a collection of reward model checkpoints. The environment setup is substantial, involving multiple packages for different reward functions. Training is designed for multi-GPU setups with 8 GPUs as the default configuration. The code is based on an earlier research codebase called DiffusionNFT.

This is a research project accompanying an academic paper and is aimed at machine learning researchers working on image generation alignment.

Open on GitHub → Full breakdown on explaingit →