DreamVideo-Omni

Python ★ 14 updated 23d ago

DreamVideo-Omni: Omni-Motion Controlled Multi-Subject Video Customization with Latent Identity Reinforcement Learning

A research tool from Alibaba that generates AI videos with fine-grained control over multiple people and objects, letting you specify reference images, motion paths, bounding boxes, and camera movements all at once.

PythonDiffSynth-StudioWan2.1setup: hardcomplexity 4/5

DreamVideo-Omni is a research project from Alibaba's Tongyi Lab and several partner universities that generates AI videos with fine-grained control over multiple people or objects and how they move. The core challenge it addresses is that existing video generation tools struggle when you want to specify both who or what appears in a video and exactly how each subject should move independently of the others.

The system accepts reference images of the subjects you want to appear, a text description, and optional motion cues: drawing paths on frames, bounding boxes that specify where each subject should be, or camera movement instructions. It can handle all three types of motion control at once, hence the name "Omni." To keep each subject recognizable throughout the video, the authors developed a training step that rewards the model when the generated faces and appearances match the references, using a technique they call latent identity reinforcement learning.

In practice, generating a video requires downloading the model weights (about 2.8 GB) plus a base model called Wan2.1 (fetched automatically on first run). You then run a Python script called infer.py and point it at a folder containing your reference images and a metadata file with your caption and motion instructions. The README includes three example cases: one using two reference images with no motion paths, one using motion tracks with no reference images, and one combining both.

The project was published as an academic paper in March 2026 and the inference code and trained weights were released in May 2026. It is built on top of two existing open-source tools: DiffSynth-Studio and Wan2.1. This is a research release aimed at developers and researchers who want to experiment with controllable video generation, not a consumer product with a graphical interface.

Where it fits

Generate a video of two specific people each following different drawn motion paths
Create a video where subjects stay in bounding-box regions you define while the camera pans
Experiment with identity-preserving video generation using your own reference face photos
Test combined motion control and text guidance in a research setting

Open on GitHub → Full breakdown on explaingit →