SadTalker

Python ★ 14k updated 2y ago

[CVPR 2023] SadTalker：Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation

Takes a still photo of a face and an audio clip and generates a realistic video of that person speaking in sync with the sound, using AI trained on 3D facial motion.

PythonAnacondaGradioPyTorchsetup: hardcomplexity 3/5

SadTalker is a research tool that takes a single still photograph of a face and an audio recording and generates a short video of that face appearing to speak in sync with the audio. You provide one image and one audio clip, and the system produces a realistic video of the face moving and talking along with the sound.

The technique was presented at CVPR 2023, a major academic computer vision conference, and was developed by researchers from Xi'an Jiaotong University, Tencent AI Lab, and Ant Group. It works by learning 3D motion coefficients from the audio and using them to animate the face in a way that follows the speech rhythm, head movements, and facial expressions implied by the sound.

To use it locally, you install the project using Anaconda (a Python environment tool), download a set of pre-trained model files, and run it from the command line or through an optional browser-based interface built with Gradio. Installation guides exist for Linux, Windows, and macOS, and there is a Colab notebook so you can try it without installing anything on your own computer. A Discord server lets you send files and receive generated videos directly, which is the simplest no-setup option.

Several modes are available, including one for full-body image animation rather than just face crops, a still mode that limits head movement, and a reference mode that uses a separate video to guide expression style. The project is licensed under Apache 2.0, removing the earlier non-commercial restriction. Pre-trained model weights can be downloaded from Google Drive, Baidu, or via a provided download script.

Where it fits

Generate a talking-head video from any portrait photo and a voiceover audio file.
Create lip-synced video content for social media posts using a static headshot.
Animate a full-body image to produce a speaking character video.
Try video generation experiments in Google Colab without a local GPU.

Open on GitHub → Full breakdown on explaingit →