gitmyhub

REALM

Python ★ 23 updated 24d ago

REALM: A Coarse-to-Fine Generative Framework for Embodied Reactive Listening (Under Review)

A research system that generates realistic head movements and facial expressions for a listener in response to a speaker's audio, including experiments on a physical humanoid robot.

PythonPyTorchsetup: hardcomplexity 5/5

REALM, short for Reactive Embodied Audio-driven Listening Model, is a research project that generates realistic listener behavior in response to what a speaker is saying. Given only the speaker's audio, the system produces head movements and facial expressions that a listener would naturally show, timed to match the rhythm and content of the speech.

The approach handles two aspects of motion separately. A coarse stage predicts smooth, slow head motion. A finer stage then adds quick facial micro-expressions on top of that, using a small amount of controlled randomness to avoid the flat, averaged-out appearance that arises when generative models predict movement without any stochastic component. A gating mechanism models the natural delay a listener has before visibly reacting to what they hear, preventing the output from jumping in response to sounds before a human listener plausibly would.

The project includes experiments deploying the generated motions onto a physical humanoid robot called Ameca, translating the output coefficients into hardware control values through an inverse kinematics mapping.

The repository is currently under review for academic publication. To comply with double-blind review requirements, the authors have withheld pre-trained model weights and disabled git clone access. You can download the source as a ZIP from the GitHub page. Training from scratch requires Python 3.10 and PyTorch 2.0 or later, and data preparation follows the process described in the ViCo Challenge Baseline repository. Training and inference scripts are provided, but the model cannot be run usefully until weights are released after the review process concludes.

Where it fits