PF-OPSD
World Models Meet Language Models: On the Complementarity of Concrete and Abstract Reasoning
An AI research project that teaches language models to reason about spatial puzzles and physical video events by combining world model predictions with language reasoning, plus two new benchmark datasets.
PF-OPSD is a research project exploring how to combine two types of AI systems: world models, which generate visual predictions of what will happen next in a scene, and multimodal language models, which can reason abstractly about goals, rules, and questions. The authors identify a problem that arises when you simply plug these together: a world model can generate visually plausible future frames that are still wrong for the specific task at hand, and the language model does not automatically know when to trust a simulation or how to weigh it against its own text-based reasoning.
The paper, linked on arXiv, calls this challenge controlled concrete reasoning and makes three contributions. The first is VRQABench, a benchmark dataset of 4,636 questions built from maze navigation and Sokoban puzzle images. Because the correct answers to spatial puzzles can be verified programmatically with a search algorithm, the question quality is ground-truth checked rather than hand-labeled. The second is OpenWorldQA, a benchmark of 4,404 questions about predicting physical outcomes from real-world video footage. Questions in this dataset were generated by a five-stage pipeline of AI agents that extracts a pre-event frame from a video, designs plausible question-answer sets, generates misleading but plausible wrong answers, filters out too-easy questions using a smaller model, and accepts only items that pass a quality review. The third contribution is the PF-OPSD training method itself: during training, the AI is given access to ground-truth future video as privileged context that a teacher model can use, and the student model learns to reason as if it had seen those futures even though it will not have access to them at test time.
Running the code requires Python, specific video datasets, and an API key for an external language model to drive the dataset construction pipelines. Prebuilt versions of both benchmark datasets are available on Hugging Face for researchers who want to evaluate models without rebuilding from scratch.
This repository is aimed at AI researchers working on vision and reasoning. The code is structured into three independent parts covering dataset construction for each benchmark and the training pipeline for the proposed method.
Where it fits
- Evaluate a vision-language model on spatial reasoning using VRQABench, 4,636 maze and Sokoban puzzle questions with verifiable ground-truth answers.
- Benchmark a model on predicting physical video outcomes using the 4,404-question OpenWorldQA dataset.
- Train a model with the PF-OPSD method so it learns to reason about future events even without access to future frames at test time.
- Use the five-stage AI pipeline to generate your own question-answer dataset from action videos.