LAVIS
LAVIS - A One-stop Library for Language-Vision Intelligence
A Python research library from Salesforce that bundles multiple vision-and-language AI models, including BLIP-2 and InstructBLIP, for image captioning, visual question answering, and text-to-image generation.
LAVIS is a Python library from Salesforce AI Research that brings together a collection of AI models capable of understanding both images and text at the same time. The kind of tasks these models can do includes describing what is in a photo, answering questions about an image, matching images to relevant text descriptions, and following natural-language instructions paired with visual input.
The library is meant to make it easier for researchers and developers to try out these vision-and-language models without rebuilding everything from scratch. It provides a consistent interface so you can load different models, run them on images or videos, and evaluate their performance across standard benchmarks using the same code patterns. The library also includes tools to load common research datasets used in this field.
Several notable models are included. BLIP-2 is a general image-language model that can be paired with a large language model to answer questions or generate descriptions. InstructBLIP extends that with instruction-following capabilities, meaning you can give it a task in plain English alongside an image. BLIP-Diffusion is a text-to-image generation model. X-InstructBLIP adds support for video, audio, and 3D input in addition to images.
The library is installable from PyPI and the README includes working Jupyter notebook examples for captioning images, answering visual questions, and extracting features. Full documentation and a benchmark comparison table are hosted separately.
LAVIS is released under a BSD 3-Clause license. It is primarily a research tool rather than a consumer product, so using it assumes familiarity with Python and machine learning workflows.
Where it fits
- Run BLIP-2 or InstructBLIP on your own images to generate captions or answer natural-language questions about them.
- Benchmark multiple vision-language models against standard research datasets using consistent evaluation code.
- Experiment with image-language models for a custom application without rebuilding training infrastructure from scratch.