imagen-pytorch

Python ★ 8.4k updated 1y ago

Implementation of Imagen, Google's Text-to-Image Neural Network, in Pytorch

Imagen is a Google research project that can generate images from written text descriptions. You type something like "a whale breaching from afar" and the system produces a matching image. This repository is an open-source Python implementation of that system, built using PyTorch, a popular framework for machine learning.

The underlying approach works by starting with a text description, converting it into numerical representations using a large language model called T5, and then using those representations to guide a noise-removal process that gradually builds a photorealistic image. The system uses multiple image-generating networks chained together: the first creates a small, rough image, and later ones increase its resolution and add fine detail.

The README includes working Python code examples showing how to set up the networks, connect them in a cascade, feed in images and text captions during training, and then sample new images from text prompts. A helper class called ImagenTrainer handles bookkeeping tasks like tracking moving averages across training steps. For larger training runs, the project uses a separate library for distributing work across multiple machines.

The project was sponsored by StabilityAI and built on tools from Hugging Face, including their text encoding library. Several community contributors helped find bugs and test the code. There is also experimental support for generating video from text, not just still images.

This is a research implementation intended for people who want to train or experiment with text-to-image models on their own hardware. It requires significant computing resources and machine learning experience to use effectively. The README is longer than what was shown.

Open on GitHub → Full breakdown on explaingit →