ml-fastvlm

Python ★ 7.4k updated 1y ago

This repository contains the official implementation of "FastVLM: Efficient Vision Encoding for Vision Language Models" - CVPR 2025

Apple's research project that makes AI image-understanding models up to 85x faster by producing fewer intermediate tokens, with pretrained models that run on iPhones, iPads, and Macs.

PythonPyTorchSwiftiOSApple SiliconLLaVAsetup: hardcomplexity 4/5

FastVLM is a research project from Apple that makes AI models faster at understanding images. Specifically, it addresses the bottleneck that occurs when an AI model has to process a high-resolution photo before it can say anything about it. The project introduces a new image-processing component called FastViTHD that produces fewer intermediate tokens, which means the model can start generating a response much sooner.

The practical result is dramatic speed improvements. The smallest variant of FastVLM responds up to 85 times faster than a comparable model, and the larger 7-billion-parameter version is nearly 8 times faster than competing approaches, all while matching or exceeding their accuracy scores. These results were published at CVPR 2025, a major computer vision conference.

The code ships in three sizes: 0.5B, 1.5B, and 7B, where the number refers to the count of parameters in the language part of the model. Pretrained weights are available for download, and running inference on a standard computer requires only a few setup commands and a Python script. The repository also includes a dedicated export path for running the models on Apple Silicon chips, including iPhones, iPads, and Macs, with a demo iOS app included to show the model working on a mobile device.

This is primarily a research release aimed at developers and researchers who want to experiment with fast vision-language models, fine-tune their own variants, or understand the technical approach described in the paper. The training pipeline builds on the existing LLaVA codebase, so anyone already familiar with that project will find the workflow recognizable.

Where it fits

Run a fast vision-language model on an iPhone or iPad using the included demo iOS app
Fine-tune a FastVLM variant on your own image dataset using the LLaVA-based training pipeline
Export a FastVLM model for Apple Silicon to power an on-device image question-answering feature in a macOS app
Benchmark FastVLM against other vision-language models to validate speed gains for a production use case

Open on GitHub → Full breakdown on explaingit →