Qwen-VLA

★ 621 updated 22d ago

The official repository of Qwen-VLA

Qwen-VLA is an AI model from Alibaba's Qwen research team designed to control physical robots. The name stands for Vision-Language-Action: the model takes in visual input (camera images), understands natural language instructions, and outputs actions that a robot can execute. It is built on top of Qwen3.5-4B, Alibaba's 4-billion-parameter language and vision model, combined with a separate 1.15-billion-parameter module specifically for generating continuous robot movement commands.

What makes this model notable in the robotics AI field is that it is designed to work across different types of robots and different tasks using a single set of weights. Most competing systems train a separate model for each robot platform or each task. Qwen-VLA instead uses a text prompt to tell the model which robot it is controlling, so the same model can handle a robot arm doing pick-and-place tasks, a mobile robot navigating a building, and other tasks without retraining. The README describes this as embodiment-aware prompt conditioning.

The technical report describes a training process with four stages: large-scale pretraining on action data, continued training combining language and action data, supervised fine-tuning, and reinforcement learning. Benchmark results in the README show the model matching or outperforming specialist models that were each trained specifically for a single benchmark, across both simulated environments and real-world robot evaluations on platforms including an ALOHA bimanual robot arm.

The repository is the official release from the Qwen team and links to a technical report on arXiv, a blog post, and a video demo. The README does not include installation instructions or code for running the model; it is primarily a research announcement and benchmark summary. The language field is listed as unknown, which suggests the repository may not yet contain significant source code beyond documentation.

Open on GitHub → Full breakdown on explaingit →