Grounded-Segment-Anything

Jupyter Notebook ★ 18k updated 1y ago

Grounded SAM: Marrying Grounding DINO with Segment Anything & Stable Diffusion & Recognize Anything - Automatically Detect , Segment and Generate Anything

Grounded-Segment-Anything combines text-prompt object detection and pixel-level masking to find and outline any object in an image just by typing its name, with optional editing via Stable Diffusion.

PythonPyTorchJupyter NotebookStable DiffusionHugging Facesetup: hardcomplexity 4/5

Grounded-Segment-Anything is a project from IDEA Research that wires several open-source AI vision models together into one pipeline. The idea is to point at any object in an image just by typing a word — for example, "the red bag" — and have the system find that object, draw a tight outline around it, and optionally edit or describe it.

It does this by chaining specialist models. Grounding DINO is an open-vocabulary object detector, meaning you give it a text prompt and it locates whatever you described in the image with a box, without needing to be retrained for each new category. Segment Anything (SAM) takes those boxes and produces a precise pixel-level mask, the actual outline of the object. The pipeline can then hand that mask to Stable Diffusion to edit the region, or to Recognize Anything (RAM) to automatically generate descriptive tags. The README is explicit that all parts are independent: any piece can be used on its own or replaced with a similar model, like swapping in a different detector or a different image generator.

You would reach for it when you need to find and outline objects in images from a text description rather than from labeled training data. The README highlights uses like automatic data labeling, open-vocabulary detection and segmentation, image editing, and data generation. There is also a follow-on project, Grounded SAM 2, for tracking objects across video.

The repository is primarily Jupyter Notebooks, so you can run and inspect each stage interactively, and the project lists hosted demos on Hugging Face Spaces, Colab, Replicate, and ModelScope. The full README is longer than what was provided.

Where it fits

Automatically label objects in images for a training dataset by describing what to find in plain text instead of drawing boxes manually.
Edit a specific region of a photo, like replacing a background, using a text description to select it and Stable Diffusion to change it.
Generate descriptive tags for a library of images automatically using the RAM tagging model in the pipeline.
Track specific objects across video frames using the follow-on Grounded SAM 2 project.

Open on GitHub → Full breakdown on explaingit →