GLD
Official implementation of "Repurposing Geometric Foundation Models for Multi-view Diffusion"
GLD (Geometric Latent Diffusion) is a research implementation of a novel technique for generating new viewpoints of a scene from just a few input photographs. The core problem it addresses is called novel view synthesis: given images of an object or scene from certain angles, produce realistic images from angles you do not have photos of, along with accurate depth and 3D structure.
The approach works differently from common video-generation methods. Instead of using a general-purpose image compression layer, GLD operates inside the feature space of models that already understand geometry — specifically models trained to estimate depth from images. A diffusion model, which works by gradually denoising random patterns into coherent outputs, learns to generate new viewpoints directly in this geometry-aware space. The pre-trained geometry models then decode these outputs into both rendered images and 3D depth maps without additional training. This design reportedly converges 4.4 times faster during training than standard approaches.
You would use this if you are a computer vision researcher working on 3D reconstruction, novel view synthesis, or scene understanding. Running the demo requires a GPU with at least 48 GB of memory, such as an A100 or A6000. The codebase is written in Python and pre-trained model weights are downloadable from HuggingFace. The demo generates 3D scene reconstructions from included sample images.