dvlt
Official implementation of Déjà View: Looping Transformers for Multi-View 3D Reconstruction
DVLT, short for Deja View Looping Transformer, is the official code release from NVIDIA Research and collaborators at universities in Italy, Canada, and Switzerland for a 3D reconstruction model published in a research paper. The model takes a set of photos of a scene and produces a 3D understanding of that scene: it estimates the depth at each pixel, figures out where each camera was positioned when the photos were taken, and builds a point cloud representing the 3D structure.
What makes DVLT unusual among similar models is its design. Instead of using a fixed set of large processing blocks that each run once, it uses a smaller shared block of computations that loops repeatedly. Each loop refines the 3D reconstruction further. The number of loops is something you can adjust at inference time: more loops means more compute and better results, fewer loops is faster. This lets a relatively small model match the quality of larger models that process everything in one pass.
The repository includes the model code, pre-trained weights available through Hugging Face, training scripts, and evaluation code for several standard 3D reconstruction benchmark datasets including DTU, ETH3D, 7Scenes, ScanNet++, and NuScenes. It also includes a browser-based interactive demo built with Gradio where you can upload photos or a video clip and see the predicted depth maps, camera positions, and 3D point cloud visualized in the browser. Alongside the DVLT model, the repository includes wrappers for several other publicly available 3D reconstruction models so their results can be compared on the same benchmarks.
Training uses PyTorch with GPU support and a configuration system called Hydra that organizes settings into files. The repository is set up for both single-GPU and multi-GPU training and evaluation using the accelerate library.