DALLE-pytorch

Python ★ 5.6k updated 2y ago

Implementation / replication of DALL-E, OpenAI's Text to Image Transformer, in Pytorch

DALL-E in Pytorch is a Python library that recreates OpenAI's original DALL-E model, which takes a text description and generates a matching image. This is a community-built implementation, not an official OpenAI release.

The original DALL-E system works in two stages. First, a component called a discrete VAE (Variational Autoencoder) is trained to compress images into a compact sequence of visual tokens, similar to how a vocabulary of words represents text. Then, a transformer model is trained to take a sequence of text tokens and predict the corresponding sequence of visual tokens. Combining the two allows the system to generate images from a written description.

The library provides both pieces as installable Python classes. You can train your own VAE from scratch, use the VAE that OpenAI released alongside the original paper, or use a third-party VAE from a related project called Taming Transformers. The code also supports DeepSpeed, a library for training large models more efficiently across multiple GPUs. Sparse attention, which reduces memory usage in the transformer, is available through an optional Triton back-end.

Community members have trained small versions on datasets ranging from 2,000 landscape photos to 150,000 layout images, and the README shows results from those experiments along with links to checkpoint files others have shared. A Colab notebook lets anyone try inference without setting up a local environment.

The author has since moved on to DALL-E 2, which lives in a separate repository. This implementation covers the first DALL-E paper only. The library is installed via pip and the code is written in Python using the PyTorch framework.

Open on GitHub → Full breakdown on explaingit →