DALLE2-pytorch
Implementation of DALL-E 2, OpenAI's updated text-to-image synthesis neural network, in Pytorch
DALL-E 2 is OpenAI's system for generating images from text descriptions: you type a sentence like "a red fox sitting on a stack of books" and the model produces a picture matching that description. This repository is an independent open-source reimplementation of that system using PyTorch, a widely used machine learning framework. It is not affiliated with OpenAI; it was created by an independent researcher and built out through contributions from the LAION community, a nonprofit group working on open datasets and models.
Training the full system happens in three sequential stages. First, you either train or reuse a CLIP model, which learns the relationship between text and images by processing large numbers of image-caption pairs. Second, you train a diffusion prior network that sits between the text encoder and the image generator. It takes the meaning extracted from a text prompt and predicts what an image should look like in abstract mathematical terms, before any pixels are drawn. Third, a decoder network learns to convert that abstract representation into real pixel output. Optional upsampler networks can then sharpen the result to a higher resolution.
Pre-trained checkpoints for the prior are available from the LAION community on Hugging Face, so you do not have to run all three training stages from scratch. The LAION group has confirmed scaling training to 800 GPUs using the scripts in this repository, and several independent researchers have verified that both the prior and decoder components work correctly.
This is a research tool rather than a consumer application. Using it requires writing Python code, access to GPU hardware, and familiarity with machine learning training workflows. The library installs via pip and the README contains detailed code examples for each training stage.
The full README is longer than what was shown.