generative-models
Generative Models by Stability AI
AI models from Stability AI that generate 3D videos and multi-view imagery from single images or videos using diffusion techniques.
This repository, Generative Models by Stability AI, is the home for a series of research models that generate visual content from images and short videos. The README walks through releases by date. Stable Video 4D 2.0, or SV4D 2.0, is described as a video-to-4D diffusion model: it takes a short input video of a moving object and produces novel-view videos that look like the same scene filmed from other camera angles. The earlier Stable Video 4D and Stable Video 3D models are also documented; SV3D is described as an image-to-video model for generating multiple synthetic views from a single picture. These are diffusion models, the family of generative AI systems that produce images or videos by gradually refining noise into a coherent output guided by an input. The README gives practical numbers for SV4D 2.0: it generates 48 frames (12 video frames across 4 camera views) at 576-by-576 resolution from a 12-frame input, ideally clean white-background footage of a single moving object. Longer outputs are produced by running the model in steps and feeding earlier results back in. Sampling scripts accept a gif or mp4 file, a folder of frames, or a filename pattern, download weights from Hugging Face, and write generated frames to an output folder. Options cover sampling steps, camera elevation, background removal, and running on cards with less memory. Someone would use this repository for research in synthesizing new views of objects from limited footage, for example multi-view content generation or 4D asset creation. The README marks the releases as for research purposes. It is written in Python and uses PyTorch with CUDA.
Where it fits
- Generate 3D object views from a single video by rendering the same object from multiple camera angles.
- Create 360-degree orbital videos around objects captured in a single still image.
- Build creative applications that turn 2D images or videos into multi-view 3D-like experiences.