echomimic_v2
[CVPR 2025] EchoMimicV2: Towards Striking, Simplified, and Semi-Body Human Animation
An AI research tool that generates a video of a person speaking from just a still photo and an audio clip, animating the lips, head, and upper body in sync with the audio. Supports English and Chinese.
EchoMimicV2 is an AI system developed by researchers at Ant Group (the company behind Alipay) that generates video animations of a person talking from just a still photo and an audio clip. You provide a reference image and a speech recording, and the system produces a video where the person in the image appears to speak, with lips, head, and upper body moving in sync with the audio. It covers more than just the face: it animates the upper half of the body including shoulder and hand movement, which the researchers call semi-body animation. The work was accepted at CVPR 2025, one of the top computer vision research conferences.
The system supports both English and Chinese audio input. Standard inference takes roughly 7 minutes to produce 120 frames of video; an accelerated version released in January 2025 cuts that to about 50 seconds on a high-end A100 GPU, a 9x improvement. A Gradio web interface lets users test it without writing Python code, and a ComfyUI integration is available for those who prefer that visual workflow tool.
The process internally aligns the reference image with pose information extracted from a driving video, then generates the final animated output. The repository includes model weights hosted on Hugging Face and ModelScope, inference scripts, a Jupyter notebook demo, and the training dataset list along with processing scripts.
This is a research release aimed at people working in AI video generation, digital avatar creation, or related areas. It is not a simple consumer application: setup requires installing several Python dependencies and downloading multi-gigabyte model weights. The README links to installation tutorials and a community discussion thread covering common setup problems.
Where it fits
- Generate a talking-head video from a single photo and a speech audio file for a digital avatar or presentation.
- Create animated spokespersons for video content without filming real people, using just a photo and script audio.
- Test EchoMimicV2 through the Gradio web interface without writing any Python code.
- Integrate EchoMimicV2 into a ComfyUI visual workflow for automated talking-head video production.