hallo2
[ICLR 2025] Hallo2: Long-Duration and High-Resolution Audio-driven Portrait Image Animation
Hallo2 is a research tool from Fudan University that turns a still photo of a person into a talking-head video, driven by an audio recording. You provide one portrait image and one audio clip, and the system generates a video where the person appears to speak, with head movements and facial expressions that follow the rhythm and tone of the audio. The output can run at 4K resolution and stay consistent for videos up to an hour in length, which is longer than most similar tools manage before visual quality starts to drift.
The project was accepted at ICLR 2025, a major international conference on machine learning research. The showcase on the project page includes examples like a Taylor Swift speech at NYU (23 minutes, 4K) and a Stanford lecture (up to 1 hour), all animated from a single portrait image. The system is designed to maintain stable identity, consistent lighting, and natural motion across those long durations without the face warping or flickering that earlier approaches tend to produce.
Setting it up requires a Linux machine with a capable GPU. The documentation lists Ubuntu 20.04 or 22.04 and CUDA 11.8, and the testing was done on an A100 GPU. You install Python dependencies through conda and pip, then download a set of pretrained model weights from HuggingFace. There are several component models involved: one for separating vocals from audio, one for detecting and tracking facial landmarks, one for motion generation, and the core animation model itself.
Once set up, you run inference by pointing the script at your portrait image and your audio file. The README includes example commands and links to a hosted demo on OpenBayes for trying the system without installing anything locally.
This is a research release aimed at people who want to study or build on the underlying technique. It is not a polished consumer product, and getting it running requires familiarity with Python environments and GPU computing. The code and pretrained weights are publicly available under the terms described in the repository.