ACoT-VLA-WM

Python ★ 105 updated 14d ago

A research system that trains robots to perform multi-step physical tasks by generating predicted images of future workspace states as visual subgoals, achieving 100 percent success on five industrial manipulation benchmarks versus 80 percent for the baseline.

Pythonuvsetup: hardcomplexity 5/5

ACOT-VLA-WM is a research project focused on improving how robots handle complex, multi-step physical tasks. It extends an earlier system called ACoT-VLA by adding a predictive world model, which generates images of what the robot's workspace should look like at future moments. These predicted images are used during training as visual subgoals, giving the robot more concrete guidance about intermediate steps rather than only the final target state.

The central problem this addresses is that high-level instructions alone are not enough for reliable manipulation. When a robot is told to pick up a scanner and scan several QR codes, it needs to understand what each phase of that task looks like in practice. By training with future-frame predictions from multiple camera angles simultaneously, the system learns to anticipate and execute each stage with more physical precision.

During training, the pipeline mixes three categories of subgoal images. The majority come from randomly sampling a real future frame between zero and four seconds ahead, which builds tolerance to timing variation. A smaller portion comes from the terminal frame of each recorded sub-step. The remaining portion comes from the world model itself, which generates predicted frames it has not seen before. This combination is designed to handle differences in execution speed and reduce failure from small physical disturbances.

On five industrial manipulation tasks, each tested ten times, the baseline ACoT-VLA system achieved an 80 percent overall success rate. The version described here reached 100 percent across the same tasks, including one involving scanning five codes on a reflective marble surface.

The code is Python, uses a tool called uv for dependency management, and expects multiple GPUs for training. Separate scripts cover dataset preprocessing, normalization statistics, model training, and deployment. The world model itself is trained in a companion repository.

Where it fits

Train a robot manipulation policy using predicted future images as intermediate visual targets for multi-step physical tasks.
Extend the ACoT-VLA baseline with world model subgoals to handle tasks that require precise positioning at each stage.
Benchmark industrial robot manipulation performance on tasks like QR code scanning with fine-grained step-level supervision.
Use the companion world model repository to generate predicted future frames for augmenting a robot training dataset.

Open on GitHub → Full breakdown on explaingit →