REALM
REALM: A Coarse-to-Fine Generative Framework for Embodied Reactive Listening (Under Review)
A research system that generates realistic head movements and facial expressions for a listener in response to a speaker's audio, including experiments on a physical humanoid robot.
REALM, short for Reactive Embodied Audio-driven Listening Model, is a research project that generates realistic listener behavior in response to what a speaker is saying. Given only the speaker's audio, the system produces head movements and facial expressions that a listener would naturally show, timed to match the rhythm and content of the speech.
The approach handles two aspects of motion separately. A coarse stage predicts smooth, slow head motion. A finer stage then adds quick facial micro-expressions on top of that, using a small amount of controlled randomness to avoid the flat, averaged-out appearance that arises when generative models predict movement without any stochastic component. A gating mechanism models the natural delay a listener has before visibly reacting to what they hear, preventing the output from jumping in response to sounds before a human listener plausibly would.
The project includes experiments deploying the generated motions onto a physical humanoid robot called Ameca, translating the output coefficients into hardware control values through an inverse kinematics mapping.
The repository is currently under review for academic publication. To comply with double-blind review requirements, the authors have withheld pre-trained model weights and disabled git clone access. You can download the source as a ZIP from the GitHub page. Training from scratch requires Python 3.10 and PyTorch 2.0 or later, and data preparation follows the process described in the ViCo Challenge Baseline repository. Training and inference scripts are provided, but the model cannot be run usefully until weights are released after the review process concludes.
Where it fits
- Train the REALM model from scratch on the ViCo Challenge dataset to generate listener head motions from speaker audio.
- Use REALM's output coefficients to drive a physical Ameca robot's head and face during a live conversation.
- Study how the two-stage coarse-to-fine motion generation avoids the averaged-out appearance common in generative listener models.
- Run inference with REALM once weights are released to generate a listener video response given only an audio clip of a speaker.