vision_transformer
Google Research's code and pre-trained models for Vision Transformer (ViT) and MLP-Mixer, image recognition models that treat image patches like words in a sentence, achieving top results on large datasets and available for fine-tuning on custom images.
This repository, published by Google Research, contains the code and pre-trained models from several research papers on image recognition. The central idea behind the Vision Transformer (ViT) approach is treating an image the same way a language model treats a sequence of words: by slicing the image into small patches and feeding those patches through the same kind of model architecture used in natural language processing. This was a notable departure from how image recognition had traditionally been done, and the papers demonstrate that this approach can match or outperform older methods when trained on large datasets.
Alongside the Vision Transformer models, the repository also includes MLP-Mixer, a related architecture that takes a different approach by using only simple matrix operations rather than the attention mechanism. The repository additionally covers follow-up research on how to train these models more effectively, including what data volumes, augmentation techniques, and regularization strategies produce the best results.
All the models were pre-trained on large image datasets and are made available for fine-tuning. Fine-tuning means taking one of these pre-trained models and continuing to train it on a smaller, task-specific dataset. The code is written in JAX and Flax, two Python-based frameworks for numerical computing and neural network research developed at Google.
The repository includes interactive Jupyter notebooks hosted on Google Colab, which let people experiment with the models through a browser without setting up a local environment. For more serious training runs, the README walks through setting up a cloud-based virtual machine.
The full README is longer than what was shown.
Where it fits
- Fine-tune a pre-trained ViT model on your own image dataset to build a custom image classifier without training from scratch.
- Run the included Colab notebooks to experiment with Vision Transformer inference directly in a browser.
- Compare MLP-Mixer against ViT on your image task to choose the architecture that fits your compute budget.