GroundingDINO

Python ★ 10k updated 1y ago

[ECCV 2024] Official implementation of the paper "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection"

An AI model that finds and locates objects in images based on text descriptions you write, instead of being limited to a fixed list of pre-trained categories, published at ECCV 2024.

PythonPyTorchHugging Face Transformerssetup: moderatecomplexity 3/5

Grounding DINO is an AI model that finds and locates objects in images based on text descriptions you provide. Traditional object detection models can only identify things from a fixed list of categories they were trained on, such as car, person, or chair. Grounding DINO takes a different approach: you describe what you want to find in plain language, and the model searches the image for it. This is called open-set detection, because the set of things it can detect is open-ended rather than fixed at training time.

The model combines two existing ideas. DINO is a vision model that learns to represent images by understanding relationships between patches of pixels. The project pairs that with a language understanding component so that text descriptions and image regions can be matched against each other. The result is a model that scores highly on standard object detection benchmarks even without being specifically trained on those datasets: the README reports a 52.5 AP score on the COCO benchmark with zero COCO training data, and 63.0 AP when fine-tuned.

The research was published at ECCV 2024 and is implemented in Python using PyTorch. Pretrained model weights are available for download, and the model can be loaded through Hugging Face's transformers library. A live demo runs on Hugging Face Spaces.

The repository also documents how Grounding DINO can be combined with other models. Pairing it with Segment Anything Model (SAM) lets you not just find objects but also draw precise outlines around them, including tracking them across video frames with Grounded SAM 2. Pairing it with Stable Diffusion opens up uses in image editing, where you first locate a region with text and then modify it. A newer version called Grounding DINO 1.5 with higher capability is available separately through an API. The project is widely used in computer vision research pipelines, automated dataset annotation, and building custom detection systems without training data.

Where it fits

Automatically detect and locate any object in an image by typing a text description, without training a custom model.
Combine with Segment Anything Model to draw precise outlines around objects you describe in plain language.
Build automated dataset annotation pipelines that label images using text prompts instead of manual bounding boxes.
Create image editing workflows that find specific regions by text description and apply modifications to just those areas.

Open on GitHub → Full breakdown on explaingit →