Sana

Python ★ 8.3k updated 3d ago

SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformer

SANA is an AI research codebase for generating high-resolution images and videos from text descriptions. You type a prompt — a written description of what you want to see — and the model produces an image or video matching that description. It comes from NVIDIA Labs and has been published at major AI research venues.

The problem it solves: generating high-quality images at large sizes (like 4K) with existing AI models is slow and demands enormous amounts of GPU memory. SANA uses a more efficient architecture called a Linear Diffusion Transformer that achieves comparable quality with lower computational cost, making it practical to run on less hardware.

How it works: diffusion models work by starting with random visual noise and gradually refining it into an image guided by your text prompt. SANA's Linear Transformer variant processes this more efficiently than standard approaches. The codebase covers multiple related models: the original image generator, a faster one-step version called Sprint, a video generation model, and a world-modeling variant (SANA-WM) for generating 720p video with camera control. Training code, inference code, and pre-trained weights are all included. It integrates with PyTorch and can run on tools like ComfyUI (a visual interface for AI image generation).

You would use SANA if you're an AI researcher or developer wanting to generate images or video from text at high resolution, fine-tune the models on your own data, or experiment with reinforcement learning for post-training. It's a Python project designed for GPU hardware. The full README is longer than what was provided.

Open on GitHub → Full breakdown on explaingit →