echomimic_v2

Python ★ 4.6k updated 4mo ago

[CVPR 2025] EchoMimicV2: Towards Striking, Simplified, and Semi-Body Human Animation

An AI research tool that generates a video of a person speaking from just a still photo and an audio clip, animating the lips, head, and upper body in sync with the audio. Supports English and Chinese.

PythonPyTorchGradioComfyUIJupytersetup: hardcomplexity 4/5

EchoMimicV2 is an AI system developed by researchers at Ant Group (the company behind Alipay) that generates video animations of a person talking from just a still photo and an audio clip. You provide a reference image and a speech recording, and the system produces a video where the person in the image appears to speak, with lips, head, and upper body moving in sync with the audio. It covers more than just the face: it animates the upper half of the body including shoulder and hand movement, which the researchers call semi-body animation. The work was accepted at CVPR 2025, one of the top computer vision research conferences.

The system supports both English and Chinese audio input. Standard inference takes roughly 7 minutes to produce 120 frames of video; an accelerated version released in January 2025 cuts that to about 50 seconds on a high-end A100 GPU, a 9x improvement. A Gradio web interface lets users test it without writing Python code, and a ComfyUI integration is available for those who prefer that visual workflow tool.

The process internally aligns the reference image with pose information extracted from a driving video, then generates the final animated output. The repository includes model weights hosted on Hugging Face and ModelScope, inference scripts, a Jupyter notebook demo, and the training dataset list along with processing scripts.

This is a research release aimed at people working in AI video generation, digital avatar creation, or related areas. It is not a simple consumer application: setup requires installing several Python dependencies and downloading multi-gigabyte model weights. The README links to installation tutorials and a community discussion thread covering common setup problems.

Where it fits

Generate a talking-head video from a single photo and a speech audio file for a digital avatar or presentation.
Create animated spokespersons for video content without filming real people, using just a photo and script audio.
Test EchoMimicV2 through the Gradio web interface without writing any Python code.
Integrate EchoMimicV2 into a ComfyUI visual workflow for automated talking-head video production.

Open on GitHub → Full breakdown on explaingit →