ml-ferret

Python ★ 8.7k updated 1y ago

Apple research AI model that can accept a drawn region in an image as input and return answers that refer back to specific image locations, designed for spatial visual reasoning and UI understanding.

PythonPyTorchsetup: hardcomplexity 5/5

Ferret is a research project from Apple that explores a specific capability in AI vision models: the ability to point at a specific region of an image and ask questions about it, and to receive answers that also point back to specific locations in the image. Most AI image models can describe a whole image or answer general questions about it, but Ferret is designed to work with fine-grained references, such as a drawn box, a dot, or a freehand scribble, and respond by identifying where specific things are in the image.

The project includes three components. The Ferret model is the core research model that accepts image regions as input and produces responses that refer to image locations. GRIT is a large dataset of about 1.1 million examples used to train the model on this type of grounding and referring task. Ferret-Bench is an evaluation dataset for testing how well models handle this combination of visual reasoning, knowledge, and spatial grounding.

A follow-on version called Ferret-UI applies the same ideas specifically to user interface screenshots, enabling the model to understand and reason about buttons, menus, and other UI elements in a screen image.

Using or running Ferret requires significant GPU resources. Training was done on 8 GPUs with 80 GB of memory each. To run the interactive demo, you need to download the model weights, set up several server processes locally, and have a compatible GPU available.

The code and data are released for research use only under non-commercial licenses. The model was published at ICLR 2024 as a spotlight paper. This is an academic research release, not a finished product.

Where it fits

Research fine-grained visual grounding where a model answers questions about a specific drawn region
Evaluate how well a vision model handles spatial references using Ferret-Bench
Train or fine-tune a grounding model using the GRIT dataset of 1.1 million region-text examples
Study AI understanding of UI screenshots and interface elements with Ferret-UI

Open on GitHub → Full breakdown on explaingit →