dinov3

Jupyter Notebook ★ 11k updated 4d ago

Reference PyTorch implementation and models for DINOv3

DINOv3 is a set of AI vision models from Meta's research team (FAIR) that are trained to understand images without being told what is in them. Instead of learning from labeled data where a human says 'this is a cat,' DINOv3 uses a self-supervised approach: it trains by comparing different views of the same image and learning to produce consistent, detailed descriptions of what it sees. The result is a model that generates rich representations of image content that can be applied to many different vision tasks.

The models are described as vision foundation models, meaning they are general-purpose backbones you can use as a starting point for more specific tasks. The README demonstrates their use for image classification, object segmentation (identifying which pixels belong to which object), monocular depth estimation (guessing how far away things are from a single photo), and mapping tree canopy height from satellite imagery. The key claim is that these representations are high quality enough to be useful across all these different tasks without needing to fine-tune the model heavily for each one.

The repository provides pretrained model weights in several sizes, ranging from 21 million parameters up to about 7 billion parameters. Smaller models are faster and cheaper to run; larger ones tend to produce better results. Models were trained on two datasets: a large web-scale image collection and a satellite imagery dataset. Both ViT (Vision Transformer) and ConvNeXt architectures are available.

Model weights are available through Meta's own download portal (which requires accepting terms) and through Hugging Face Hub. The models are also integrated into the Hugging Face Transformers library and the timm library, which makes them accessible through standard tooling in the research community.

The full README is longer than what was shown.

Open on GitHub → Full breakdown on explaingit →