gitmyhub

VAR

Jupyter Notebook ★ 8.7k updated 7mo ago

[NeurIPS 2024 Best Paper Award][GPT beats diffusion🔥] [scaling laws in visual generation📈] Official impl. of "Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction". An *ultra-simple, user-friendly yet state-of-the-art* codebase for autoregressive image generation!

VAR is a 2024 NeurIPS Best Paper AI image generation project that builds images coarse-to-fine across scales using next-scale prediction, outperforming diffusion models in several benchmarks. Pretrained models from 310M to 2.3B parameters are available on Hugging Face.

PythonPyTorchJupyter NotebookHugging Facesetup: hardcomplexity 4/5

VAR, short for Visual Autoregressive Modeling, is a research project from 2024 that introduced a new way to generate images with artificial intelligence. It won the Best Paper Award at NeurIPS 2024, one of the most prestigious AI research conferences. The paper argues for an approach to image generation that competes with and in some benchmarks outperforms the diffusion-based methods (like Stable Diffusion) that have dominated AI image generation in recent years.

The core idea is a shift in how image generation is framed. Most autoregressive image models generate an image pixel-by-pixel or token-by-token in a left-to-right, top-to-bottom order, similar to how a language model writes text one word at a time. VAR instead generates images coarse-to-fine: it first predicts a very low-resolution version of the whole image, then progressively refines it at increasing resolutions until the final image is complete. The paper calls this "next-scale prediction" rather than "next-token prediction."

The repository provides pretrained models of several sizes, ranging from 310 million to 2.3 billion parameters, available for download from Hugging Face. A Jupyter notebook is included so you can load a model and generate images without writing much code yourself. Larger models produce better results as measured by a standard quality metric called FID (lower is better), and the paper documents that these improvements follow predictable scaling laws similar to what has been observed in large language models.

Training the model from scratch requires the ImageNet dataset and substantial compute. The README includes training scripts and configuration details for researchers who want to reproduce or extend the work. For most people, the pretrained weights and the demo notebook are the practical entry point.

A follow-on project called Infinity, also linked from this repository, extends the VAR approach to text-to-image generation and was accepted at CVPR 2025.

Where it fits