vggt-omega

Python ★ 3.1k updated 1mo ago

[CVPR 2026 Oral] VGGT Omega

VGGT-Omega is a research AI model from Meta AI and the University of Oxford's Visual Geometry Group, presented at CVPR 2026. Its job is to look at a set of photos and figure out the 3D structure of the scene — specifically where each camera was positioned and angled when each photo was taken, and how far away every point in the image is from the camera (called depth estimation).

The code example shows the core workflow: you load a batch of images, pass them through the model, and get back camera positions (called extrinsics and intrinsics — essentially where the camera was and what lens it used) along with a depth map for each image. The model comes in two versions: a 512-resolution version for high-quality output, and a 256-resolution version that has been aligned with text descriptions so you can also extract an embedding that relates the visual content to language. A Gradio-based interactive demo is included — you upload images or a video, and it shows a 3D point cloud visualization of the reconstructed scene.

Running the model requires a GPU with enough memory; processing a single image takes about 6 GB, and 100 frames at once takes around 13 GB, on an NVIDIA A100. Pretrained model weights are available on Hugging Face but require an access request before downloading.

Open on GitHub → Full breakdown on explaingit →