pytorch-image-models
The largest collection of PyTorch image encoders / backbones. Including train, eval, inference, export scripts, and pretrained weights -- ResNet, ResNeXT, EfficientNet, NFNet, Vision Transformer (ViT), MobileNetV4, MobileNet-V3 & V2, RegNet, DPN, CSPNet, Swin Transformer, MaxViT, CoAtNet, ConvNeXt, and more
The largest collection of image recognition AI models in one library, load any of 1,000+ pretrained models with one line of Python code. The go-to toolkit for anyone building computer vision projects with PyTorch.
PyTorch Image Models, known as timm, is the largest open-source collection of image recognition model architectures and pretrained weights for the PyTorch deep learning framework. It solves a practical problem in computer vision research and production: researchers and engineers frequently need to swap between dozens of different neural network architectures for image tasks (classification, feature extraction, object detection backbones), and building each from scratch or hunting across separate repositories is time-consuming and error-prone.
The library provides a unified API for loading any supported model — over 1,000 architectures including ResNet, EfficientNet, Vision Transformer (ViT), Swin Transformer, ConvNeXt, MobileNet, and many others — with pretrained weights automatically downloaded from the Hugging Face Hub. You call timm.create_model("resnet50", pretrained=True) and you have a working, weight-loaded model ready for training or inference. The key abstraction is that all models share the same interface for feature extraction, so you can use any architecture as a backbone for downstream tasks like object detection or segmentation without rewriting glue code. The library also ships production-quality training scripts, augmentation pipelines, and a suite of optimizers, making it usable as an end-to-end training toolkit rather than just a model zoo.
You would use timm when benchmarking different architectures, fine-tuning a pretrained model on your own dataset, or building a computer vision system that needs a strong image encoder. It is the standard first stop in the computer vision research community for reproducing published results. The tech stack is Python with PyTorch as the only hard dependency; pretrained weights live on the Hugging Face Hub.
Where it fits
- Load a pretrained image recognition model in one line and start making predictions immediately
- Fine-tune a powerful pretrained model on your own image dataset without building it from scratch
- Benchmark 10+ different neural network architectures on your data to find the best one
- Use any model as an image feature extractor backbone inside a larger object detection pipeline