DynaFLIP
A pretrained AI model that helps robots connect what they see, what they are told, and how objects move in 3D space, enabling robot learning systems to understand and follow real-world instructions.
DynaFLIP is a research project that produces a pretrained AI model designed to help robots better understand their surroundings. It was published alongside a research paper and is aimed at researchers and engineers working on robot learning, not general-purpose developers.
The core idea is that robots need to connect three different kinds of information at once: what they see in images, what they are told in natural language instructions, and how objects are moving in 3D space. DynaFLIP trains a model that learns to align these three types of information in a shared mathematical space, so that an image of a scene, the phrase "close the fridge," and a recording of how a hand moved all end up represented in a way that the model can compare and relate to each other. The model is built from three separate encoders, one for images, one for language, and one for 3D motion trajectories, that are trained together.
For someone who just wants to use the model rather than train it, the pretrained version is available on Hugging Face and can be loaded with two standard Python packages. Once loaded, you can pass an image to get a numerical description of what is in it, or pass text to get a numerical description of the instruction. These representations can then feed into downstream robot control systems.
For researchers who want to train their own version, the repository includes the full training code, a configuration file, and a script for multi-GPU training. Training requires a dataset of robot demonstrations with matching images, language annotations, and 3D motion data, with the dataset paths configured before the training run starts. A separate script converts a trained checkpoint into the Hugging Face format for distribution.
The project is built on top of several established research tools including DINOv2 for image encoding, T5 for language encoding, and PyTorch Lightning for the training framework. It is licensed under the Apache 2.0 license.
Where it fits
- Load the pretrained model from Hugging Face to get numerical embeddings of images or instructions for a robot control system.
- Train a custom version of DynaFLIP on your own robot demonstration dataset with matching images, language, and 3D motion recordings.
- Use DynaFLIP embeddings to build a robot that can follow natural language instructions like 'open the drawer' based on what it sees.