InstructSAM

Python ★ 84 updated 24d ago

The code for "InstructSAM: Segment Any Instance with Any Instructions"

This is a research project that lets you point an AI model at an image and describe, in plain language, what you want highlighted. You can name a category such as "cat", write a phrase referring to a specific thing such as "the woman in the red jacket", or pose a reasoning-style question. The model returns pixel-level outlines for each individual matching object in the image, rather than a single blurry region covering everything that fits the description.

The system is built on a vision-and-language model with two billion parameters and is trained in two stages. The first stage teaches the model to follow instructions and produce segmentation masks. The second stage, described as reasoning fine-tuning, improves its ability to handle more complex or indirect descriptions. Both training scripts and evaluation scripts are included in the repository. The training data, called Inst2Seg, and a benchmark dataset are published on Hugging Face.

To run the model on a single image, you provide an image path and a text query. The script prints the generated text and mask confidence scores, then writes image files showing the masks overlaid on the original photo. Pre-trained model weights are available on Hugging Face, so you do not need to run training yourself to try the model.

Setup requires creating a Python 3.10 environment and installing several packages. One of them, flash-attn, requires compilation from source. A GPU is necessary for running the model. The project was released in May 2026 and links to an accompanying research paper on arXiv.

Open on GitHub → Full breakdown on explaingit →