LatentSync

Python ★ 5.8k updated 1y ago

Taming Stable Diffusion for Lip Sync!

LatentSync is a research project from ByteDance that automatically makes a person's lip movements in a video match a different audio track. You give it a video of someone talking and a new audio clip, and it rewrites the mouth area so the lips appear to be saying the new words. This is sometimes called lip-sync or dubbing automation.

The system works by combining two existing AI components. It uses Whisper, an audio recognition model, to convert the sound into a format that carries timing and phonetic information. That information is then fed into a modified version of Stable Diffusion, a popular image generation model, which regenerates the face frame by frame so the mouth matches the audio. The whole process happens in one stage rather than two separate steps, which the authors say reduces certain visual artifacts.

To run it, you need a computer with a dedicated graphics card. The lighter version (1.5) requires at least 8 GB of video memory, while the higher-resolution version (1.6, which produces 512x512 pixel output) requires 18 GB. You can run inference either through a simple browser-based interface built with Gradio or from the command line. A setup script downloads the required model checkpoints automatically.

The repository also includes the full training pipeline for researchers who want to train their own version. This covers data preparation steps such as video segmentation, face alignment, audio resampling, and quality filtering. Training the model from scratch requires substantially more GPU memory, ranging from 23 GB to 55 GB depending on the configuration.

LatentSync is released as an open-source research tool. It is intended for research and creative experimentation rather than production deployment.

Open on GitHub → Full breakdown on explaingit →