video-retalking

Python ★ 7.3k updated 1y ago

[SIGGRAPH Asia 2022] VideoReTalking: Audio-based Lip Synchronization for Talking Head Video Editing In the Wild

VideoReTalking is a research tool that replaces the lip movements in a talking-head video to match a new audio track, making a person appear to say completely different words.

PythonPyTorchCUDAsetup: hardcomplexity 4/5

VideoReTalking is a research tool that takes an existing video of a person talking and replaces their lip movements to match a different audio track. The result is a video where the person appears to be saying whatever words you provide in the audio, even if the original video had completely different speech. This was published as a research paper at SIGGRAPH Asia 2022, a major conference for computer graphics.

The tool works in three steps that run one after another without requiring manual work between them. First, it adjusts the facial expressions in each frame of the video to a neutral baseline so that the lip-sync step has a consistent starting point. Second, it uses a separate model that takes that normalized video along with your new audio and generates new lip movements that match the sounds. Third, a final step cleans up the result, sharpening the face region and making it look more photorealistic.

To use it, you provide a video of a face and an audio file, and the tool produces a new video with the lips resynced. You can also influence the expression of the output by choosing templates like neutral or smile, or by modifying the upper face region with options like surprised or angry.

Setup requires Python, PyTorch, and a CUDA-capable graphics card. The instructions are written for CUDA 11.1, and you also need to download pre-trained model files separately before running. A Google Colab notebook is available if you want to try it without setting up a local environment.

The code was produced by researchers at Xidian University and Tencent AI Lab. It runs entirely offline and does not send data anywhere. The repository also points to several related projects that work on similar problems, such as generating talking head animations from a single still image.

Where it fits

Dub a talking-head video into a different language by syncing new audio to the speaker's lip movements
Make a recorded speaker appear to deliver different words than what they originally said
Research and benchmark lip-sync quality by testing different audio inputs against the same source video

Open on GitHub → Full breakdown on explaingit →