TaskMatrix

Python ★ 34k updated 2y ago

A research system that lets you edit and understand images through conversation by connecting a language model to specialized visual AI tools.

PythonCUDAOpenAI APIHugging FaceLangChainStable DiffusionSegment Anythingsetup: hardcomplexity 4/5

TaskMatrix is a research project from Microsoft that connects a large language model (like ChatGPT) to a collection of specialized AI visual tools, allowing users to work with images through natural conversation. The core idea is that a general-purpose language model is very good at understanding instructions and planning, but it cannot directly manipulate images — while dedicated "visual foundation models" like Stable Diffusion, ControlNet, and BLIP are extremely good at image tasks but require specific prompts and programmatic calls. TaskMatrix acts as the bridge between these two worlds.

When you type a request such as "turn this photo into a sketch and then colorize it" or "find all the cats in this image and segment them," the language model interprets what you want, decides which sequence of visual tools to use, calls them in order, and returns the result as part of the conversation. Each visual capability — generating images from text, editing by instruction, extracting depth maps, answering questions about an image, detecting objects by description, and more — is wrapped as a pluggable module you can load onto available GPU memory.

The project introduces a "template" concept, where complex multi-step workflows can be pre-defined and reused. For example, extending an image infinitely outward in any direction is handled by a template that chains together image captioning, inpainting, and visual question-answering models without any additional training.

You would use TaskMatrix if you are a researcher exploring how AI agents can coordinate multiple specialized models, or if you want to experiment with a conversational interface for sophisticated image editing and understanding tasks. It is a Python project that requires a CUDA GPU for most visual models, uses OpenAI's API for the language model backbone, and integrates tools from Hugging Face, LangChain, and Meta's Segment Anything Model.

Where it fits

Turn a photo into a sketch and then colorize it through natural conversation.
Find and segment all objects of a specific type (like cats) in an image by describing what you want.
Extend an image infinitely outward in any direction using chained inpainting and captioning.
Answer questions about image content and perform multi-step visual editing without writing code.

Open on GitHub → Full breakdown on explaingit →