SpatialClaw

Python ★ 236 updated 10d ago

SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning

SpatialClaw is a research framework from NVIDIA and KAIST that helps AI systems answer questions about where objects are, how they relate to each other, and how they move in three-dimensional space. It was designed to improve the performance of large vision-language models (AI systems that can both see images and understand text) on tasks that require thinking about depth, distance, and physical arrangement.

The core idea is to let the AI write Python code one step at a time instead of trying to figure out everything at once. The framework keeps a running Python session loaded with image analysis tools, including a segmentation tool (SAM3, which can identify and outline objects in images), a depth estimation tool (Depth-Anything-3, which estimates how far away things are), and standard math and visualization libraries. The AI agent writes one short code block, sees the results, then writes the next block based on what it learned, repeating until it is confident enough to give a final answer.

Running it requires at least one GPU machine and involves setting up two or three separate services: a model-serving process, a perception-tool server for the heavier image analysis tasks, and the agent itself, which manages the code-execution sessions. The README includes setup scripts and instructions for both single-machine and high-performance computing cluster (SLURM) environments, though it notes that downloading model weights and configuring the cluster takes additional steps beyond the basic install.

The project was evaluated across 20 spatial reasoning benchmarks, covering tasks like estimating distances in photos, understanding object arrangements, and reasoning about motion over time. It reports 59.9% average accuracy across those benchmarks, outperforming the previous best comparable agent by 11.2 percentage points. It also works with six different AI backbone models ranging from 26 billion to 397 billion parameters, without any benchmark-specific tuning.

This is a research codebase, not a finished product. It is aimed at researchers and engineers studying how AI agents handle spatial perception, not end users looking for a ready-made application.

Open on GitHub → Full breakdown on explaingit →