latent-diffusion
High-Resolution Image Synthesis with Latent Diffusion Models
The original 2021 research code for Latent Diffusion Models, the technique that became Stable Diffusion, for generating high-quality images from text prompts, inpainting, and super-resolution.
Latent Diffusion Models (LDM) is a research repository from 2021-2022 that introduced the core technique behind Stable Diffusion: generating high-resolution images by running the diffusion process in a compressed latent space rather than directly on pixels. By compressing images into a much smaller representation first, the model can produce detailed images far more efficiently than earlier pixel-space diffusion approaches.
The repository contains the research code and pre-trained weights from the paper "High-Resolution Image Synthesis with Latent Diffusion Models" by researchers at Ludwig Maximilian University of Munich and Heidelberg University. It supports several tasks: text-to-image generation (type a prompt, get an image), class-conditional image synthesis (generate images of specific ImageNet categories), image inpainting (fill in masked regions of a photo), super-resolution, and image-to-image translation tasks.
The largest pre-trained model available is 1.45 billion parameters, trained on the LAION-400M dataset, which is a large collection of image-text pairs scraped from the web. A web demo of this model was made available on Hugging Face Spaces. The repository also includes a variant called Retrieval-Augmented Diffusion Models (RDMs), which conditions image generation on visually similar images retrieved from a database such as OpenImages or ArtBench, in addition to a text prompt.
Setup requires a conda environment and separately downloaded model checkpoint files. Several Python scripts handle different generation tasks, and Jupyter notebooks are included as examples. Sampling speed and image quality can be tuned through flags like ddim_steps and scale. The repository was published alongside the academic paper and includes a BibTeX citation for use in research. This code predates Stable Diffusion, which is a later refinement of the same underlying technique.
Where it fits
- Generate images from text prompts using the 1.45 billion parameter pre-trained model
- Fill in masked regions of a photo using the inpainting model
- Upscale low-resolution images to higher resolution using the super-resolution model
- Run retrieval-augmented image generation by conditioning on visually similar images from a database