mmf
A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)
MMF is a research framework from Facebook AI Research for building and experimenting with AI models that work with both images and text at the same time. This area of research is called multimodal AI, because it combines multiple types of input (visual and language) rather than just one. For example, a multimodal model might answer questions about a photo, generate captions for images, or detect hateful content that pairs an image with text.
The framework is built on top of PyTorch, a widely used AI development library. It is designed to be modular, meaning researchers can swap out individual components like the dataset loader, the model architecture, or the training loop without rewriting everything else. It also supports distributed training, which means it can spread work across multiple machines or graphics cards to handle large experiments faster.
MMF includes reference implementations of several published research models, and it has been used internally at Facebook for a number of AI research projects. It also served as the official starter codebase for several public AI challenges including the Hateful Memes challenge and the TextVQA challenge, where teams compete to build better models for understanding text inside images.
The project was previously called Pythia before being renamed to MMF. Installation instructions and full documentation live at mmf.sh rather than in the repository itself. The README is brief and points to the external documentation site for most setup and usage details.