diffusion-bench

Python ★ 33 updated 17d ago

Towards Holistic evaluation of Generative Diffusion Transformers!

A research toolkit with a single consistent interface for training and evaluating diffusion transformer image-generation models on ImageNet class generation and text-to-image tasks.

PythonPyTorchdiffusion transformersVAEsetup: hardcomplexity 4/5

DiffusionBench is a research toolkit for training and testing AI image-generation models, specifically a category of models called diffusion transformers. The name comes from the field's terminology: these models generate images by starting from noise and gradually refining it, and the transformer part refers to a particular architectural style borrowed from language models. If you have ever seen tools like Stable Diffusion or FLUX generate an image from a text prompt, those are the kinds of models this repository is built to study.

The codebase provides a single, consistent interface for running experiments across two broad tasks. The first is ImageNet generation, where the model learns to produce images belonging to specific categories (dogs, chairs, etc.) given a class label. The second is text-to-image generation, where the model takes a written description and produces a matching image. Having both tasks in one place means researchers can swap a configuration file and run the same training or evaluation code on either task without rewriting anything.

Training happens in two stages. The first stage trains a component called an RAE tokenizer, which compresses images into a compact representation that the main model can work with more efficiently. The second stage trains the actual diffusion model on top of that representation, or on alternative representations like VAE. The repository supports over 30 different representation encoders and a range of transport and prediction methods, giving researchers many combinations to compare.

Evaluation is also built in. During training, quality metrics are computed automatically. For standalone testing of a released checkpoint, a separate set of configuration files handles the setup so researchers do not need to manually wire the weights to the evaluation scripts. The metrics used vary by task: FID and IS scores for ImageNet, and benchmarks like GenEval and DPGBench for text-to-image.

The project is designed to be extended and welcomes outside contributions. It notes compatibility with coding agents and with an AutoResearch workflow on a separate branch, suggesting the authors intend it as a shared platform for the research community rather than a finished product.

Where it fits

Train a diffusion transformer on ImageNet generation using the two-stage RAE tokenizer and diffusion model pipeline.
Evaluate an existing text-to-image checkpoint on GenEval and DPGBench without rewriting evaluation scripts.
Swap a configuration file to compare over 30 different representation encoders under identical training conditions.

Open on GitHub → Full breakdown on explaingit →