stable-diffusion
A latent text-to-image diffusion model
The original research code for Stable Diffusion, an AI model that generates images from text prompts using latent diffusion, built for researchers and developers, not casual end users.
This is the original research repository for Stable Diffusion, an AI model that generates images from text descriptions. You type a written prompt like "a photograph of an astronaut riding a horse" and the model produces a realistic or artistic image matching that description. The core problem it solves is turning natural language into visual output, which has uses in art, design, prototyping, and creative exploration.
The model works using a technique called latent diffusion. Rather than working directly with full-size pixel images, it compresses images into a smaller mathematical representation called a latent space, then applies a diffusion process in that compressed space. Diffusion works by starting from random noise and gradually refining it, guided by a text encoder (specifically CLIP ViT-L/14) that translates your written prompt into numerical signals the model can follow. The result is decoded back into a 512x512 pixel image. This approach is more computationally efficient than operating on raw pixels, allowing the model to run on consumer GPUs with at least 10GB of video memory.
You would use this repository if you are a researcher or technically experienced developer who wants to run text-to-image generation locally, experiment with the model weights, or study how latent diffusion models work. It is not a polished user-facing application; it is a research artifact with command-line scripts and Jupyter Notebooks. End users looking for a friendlier experience would typically use this model through a tool like Hugging Face Diffusers instead. The tech stack is Python, PyTorch, and CLIP, with the repository organized as Jupyter Notebooks and Python scripts. Model weights are distributed separately via Hugging Face under a license that permits commercial use but includes responsible-use conditions.
Where it fits
- Run text-to-image generation locally on a GPU to produce images from written prompts for research or creative experiments.
- Study how latent diffusion models work by reading and modifying the sampling and training code directly.
- Experiment with the pretrained model weights to understand how text prompts influence image generation output.