mae

Python ★ 8.3k updated 1y ago ▣ archived

PyTorch implementation of MAE https//arxiv.org/abs/2111.06377

This repository contains a PyTorch implementation of Masked Autoencoders (MAE), a technique for training image recognition models developed by researchers at Facebook. The core idea behind MAE is to teach a model to understand images by hiding random patches of an image and asking the model to reconstruct the missing parts. Through this self-supervised training process, the model learns rich visual features without needing labeled data.

The training happens in two phases. First, the model is pre-trained on a large collection of unlabeled images using the masking approach. Then the pre-trained model is fine-tuned on a labeled dataset for a specific task, such as classifying what object is in a photo. The researchers found this approach produces models that generalize well: the same pre-trained weights perform strongly across a variety of image recognition benchmarks, including tests that involve sketches, corrupted images, and adversarial examples designed to fool classifiers.

Pre-trained model weights are available for three model sizes called ViT-Base, ViT-Large, and ViT-Huge. These names refer to the Vision Transformer architecture, a type of neural network that processes images by dividing them into patches and treating those patches similarly to how language models process words. The largest model (ViT-Huge at 448 pixel input) achieved 87.8% accuracy on ImageNet, which was state of the art at the time of publication.

The repository includes code for the visualization demo, fine-tuning on new datasets, and running the pre-training process from scratch. A Colab notebook lets anyone try the visualization without a local GPU. The project is released under the CC-BY-NC 4.0 license, which allows non-commercial use.

Open on GitHub → Full breakdown on explaingit →