Dream.exe

★ 27 updated 16d ago

Dream.exe: Can Video Generation Models Dream Executable Robot Manipulation?

A research benchmark that tests whether AI-generated robot videos depict physically valid actions by converting them into robot movement plans and running them in a physics simulator.

setup: hardcomplexity 4/5

This is a research project from the Show Lab at the National University of Singapore, in collaboration with Oxford and Tencent. The central question it investigates is whether AI models that generate videos can produce videos of robots doing tasks that would actually work if a real robot tried to follow them.

The typical way to judge a video generation model is by asking whether the video looks realistic. Dream.exe takes a different approach: it converts the motion shown in a generated video into an actual robot movement plan, runs that plan in a physics simulator, and checks whether the task gets completed. A video can look convincing but still fail this test if the robot movement it depicts is physically impossible or poorly timed.

The project includes a benchmark of 101 tasks drawn from a robotics dataset called RoboCasa. The tasks are organized into three difficulty levels: simple single-object manipulation (pick something up, put it down), multi-object interactions where the robot needs to reason about how objects relate to each other, and multi-stage tasks that require the robot to complete several steps in the right order. Eight different video generation models were tested under this benchmark, including both open-source and closed-source systems.

The findings from the paper suggest that AI models trained on general internet video already carry some understanding of physical cause and effect, since several models achieved measurable success at completing tasks despite no robot-specific training. The research also found that how visually polished a video looks is a poor indicator of whether the robot actions it depicts would actually work.

At the time of this writing, the repository is a placeholder. The code, benchmark data, and evaluation tools are listed as coming soon. Only the research overview and citation information are currently present.

Where it fits

Evaluate whether a video generation model produces physically plausible robot motions for a given manipulation task.
Benchmark multiple video generation models on a standardized set of 101 robot tasks across three difficulty levels.

Open on GitHub → Full breakdown on explaingit →