dvc
🦉 Data Versioning and ML Experiments
DVC adds Git-style version control for large data files, ML models, and experiments to machine learning projects, syncing big files to cloud storage while keeping lightweight pointers in your Git repo.
DVC, short for Data Version Control, is a command-line tool and a VS Code extension that helps people working on machine learning projects keep track of their data, their models, and their experiments. The problem it tackles is that Git is great for code but awkward for large data files; DVC fills that gap by letting you version data the same way you version code.
The way it works is by being roughly a "Git for data": you keep your code in a normal Git repository, and you tell DVC to track the larger files, like images, datasets, and trained models. DVC stores those big files in a cache outside of Git and uploads them to a remote of your choice — any major cloud storage like S3, Azure, or Google Cloud, or on-premise storage over SSH. In your Git repo it leaves small placeholder files that point at the cached versions. On top of that, DVC works like a Makefile for machine learning: you describe pipeline stages that say which inputs produce which outputs, and when something changes, only the affected steps rerun. There is also experiment tracking that lives in your local Git repo with no separate server, letting you run many experiments and compare their parameters, metrics, and plots.
You would use DVC when your ML project has grown beyond fitting comfortably in Git and you want reproducibility — being able to share a project so someone else can recreate any given experiment by pulling code from Git and data from the configured remote. DVC is a Python tool installed through pip, conda, Homebrew, Chocolatey, snap, or the VS Code marketplace, with optional extras like dvc-s3 or dvc-azure for specific remotes. The full README is longer than what was provided.
Where it fits
- Version large training datasets and trained models the same way you version code, so any past experiment is fully reproducible.
- Share a machine learning project with teammates so they can pull the exact same data version and recreate any experiment from scratch.
- Track and compare multiple experiment runs with parameters, metrics, and plots stored locally in Git, no separate server needed.