gitmyhub

3DConsistency-metrics

Python ★ 15 updated 28d ago

Official Code for "Can These Views Be One Scene?"

Research code from Johns Hopkins benchmarking 3D reconstruction AI models for reliability, exposing that popular models confidently produce 3D geometry from random noise, and providing COLMAP-based alternatives that match human judgment up to 4x better.

PythonCUDAGradioHugging FaceCOLMAPsetup: hardcomplexity 4/5

This repository contains the code and benchmark data from a research project at Johns Hopkins University asking a specific question: when AI models reconstruct a 3D scene from multiple photos, can those models be trusted? The short answer the paper gives is often no, and this code exists to measure and expose that problem.

The core finding is that several widely-used 3D reconstruction models, including VGGT, MASt3R, DUSt3R, and Fast3R, will confidently produce 3D geometry even when fed pure random noise as input. This is a serious reliability problem. Evaluation tools built on top of these models inherit the flaw, meaning they can report that a set of images looks like a consistent 3D scene when it is, in fact, nonsense.

To study this, the researchers built SysCON3D, a controlled benchmark dataset with different categories of broken input: pure Gaussian noise, mixed scenes that combine unrelated images, single outlier frames, and patched corruptions, alongside clean working scenes as a baseline. The dataset is hosted on Hugging Face and the code here downloads and evaluates it. There is also a human evaluation site where people rated scene consistency, giving the researchers a way to check whether automated metrics agree with human perception.

As an alternative to the flawed learned metrics, the code also provides COLMAP-based evaluation. COLMAP is a classical geometry tool that uses feature matching and geometric reconstruction rather than learned neural networks, and the paper shows these classical metrics correlate up to four times better with human judgments than the existing learned approach.

Practically, the repository includes scripts for running the interactive comparison demo (a Gradio web app where you can upload images and see how different models reconstruct them), generating benchmark assets, and running the full suite of metrics on any folder of images. It requires Python 3.10 or 3.11 plus a GPU with the appropriate CUDA setup. Model checkpoints are not bundled but download automatically from Hugging Face at runtime.

Where it fits