hallo

Python ★ 8.7k updated 1y ago

Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation

A research AI tool that takes a still portrait photo and an audio clip and generates a short video of that face speaking or singing in sync with the audio, running on a GPU with pretrained models from HuggingFace.

PythonPyTorchCUDAffmpegHuggingFacesetup: hardcomplexity 5/5

Hallo is a research project from Fudan University and collaborators that takes a still portrait photo and an audio clip, then generates a video of that person's face moving and speaking in sync with the audio. The name stands for Hierarchical Audio-Driven Visual Synthesis. The core idea is that you supply one image of a face and one recording of speech or singing, and the system produces a short animated video where the portrait appears to talk or perform the audio naturally.

The project was built by a team of researchers and comes with pretrained model weights you can download from HuggingFace. Once the weights are in place, you run a single Python inference script pointing at your image and audio files. It requires a Linux machine with a compatible NVIDIA GPU and CUDA installed. The README specifically lists Ubuntu 20.04 or 22.04 and CUDA 12.1, with testing done on A100 graphics cards. Setup involves creating a Python environment, installing the listed packages, and also installing ffmpeg for video processing.

Beyond basic inference, the team later released training code as well, so users with their own image and audio data can attempt to train or fine-tune models themselves. The community has built several wrappers around the core code, including a Windows port, a Docker image, a ComfyUI integration, and a web-based interface, all linked from the README. There is also a Hugging Face hosted demo where you can try the tool in a browser without installing anything locally.

The repository targets researchers and developers who want to experiment with audio-driven face animation. Non-technical users looking for a quick browser demo can use the Hugging Face space. Anyone who wants to run it locally will need some comfort with command-line setup, GPU hardware, and downloading large model files. The README walks through each step including model download, data preparation, and the inference command.

Where it fits

Animate a still portrait photo to speak or sing by providing an audio clip and running the inference script
Create talking-head videos for presentations or demos without recording real video footage
Try portrait animation in a browser using the hosted Hugging Face demo without installing anything locally
Train or fine-tune the model on custom image and audio data using the released training code

Open on GitHub → Full breakdown on explaingit →