VLM-R1
Solve Visual Understanding with Reinforced VLMs
Research framework that trains vision-language AI models using reinforcement learning instead of supervised fine-tuning, achieving better out-of-domain generalization on visual object detection and understanding tasks.
VLM-R1 is a research project that applies reinforcement learning training techniques to vision-language models. Vision-language models are AI systems that can analyze images and respond to natural language instructions about them. The "R1" name refers to a training style inspired by DeepSeek-R1, a language model that showed strong improvements from reinforcement learning over standard supervised training.
The central finding of this project is that reinforcement learning produces better generalization than supervised fine-tuning (SFT). When models are tested on data outside their training set, SFT models begin to perform worse as training continues, while the RL-trained models keep improving. This difference in out-of-domain performance is the core motivation for the work.
The project applies this training approach to two visual tasks. The first is Referring Expression Comprehension, where the model locates a specific object in an image based on a natural language description. The second is Open-Vocabulary Detection, where the model detects objects from categories not seen during training. The VLM-R1 math reasoning model also reached first place on a public multimodal math leaderboard for models under 4 billion parameters.
Training uses the GRPO algorithm (Group Relative Policy Optimization). The codebase supports full fine-tuning, LoRA fine-tuning (a lighter-weight training approach), multi-node training across multiple machines, and inputs containing multiple images. It works with Qwen2.5-VL and InternVL base models, with documentation for adding new architectures.
Pre-trained model checkpoints and datasets are available on HuggingFace, along with interactive demos. A technical report is published on arXiv. The project also includes inference support for Huawei Ascend hardware.
Where it fits
- Fine-tune a vision-language model to locate objects in images from text descriptions using GRPO reinforcement learning.
- Reproduce DeepSeek-R1-style RL training on a multimodal model using Qwen2.5-VL or InternVL as the base.
- Train a model to detect objects from categories never seen during training using the open-vocabulary detection setup.
- Run multi-node distributed training of a vision-language model across multiple machines using the provided configuration.