kimodo
Official implementation of Kimodo, a kinematic motion diffusion model for high-quality human(oid) motion generation.
Kimodo is an AI model from NVIDIA that generates realistic 3D human and robot movement from text descriptions. Given a text prompt like "a person walks forward then sits down," Kimodo produces a corresponding sequence of poses — a motion clip — that can be used in animation, game development, robotics simulation, or research.
The model is trained on 700 hours of optical motion capture data, which is the technique of recording actual human movement using reflective markers and cameras. Beyond text prompts, Kimodo can also be controlled by more precise constraints: full-body pose keyframes (exact positions at specific moments in time), end-effector positions (where hands or feet should be at given points), 2D paths, and 2D waypoints. This gives animators and robotics engineers fine-grained control over the output.
Several model variants are available, covering different skeleton formats. Some support human motion using the SOMA and SMPL-X body models (standard body representations used in research), and some target humanoid robots using the Unitree G1 robot skeleton. Models are downloaded automatically on first use from Hugging Face (a popular AI model hosting platform).
The repository includes a command-line tool for generating motions, an interactive timeline-based demo for authoring animations, a benchmarking suite for comparing motion generation models, and training data annotations. Running the model locally requires approximately 17GB of GPU video memory (VRAM). It is written in Python and released by NVIDIA Research. The full README is longer than what was provided.