datasets

Python ★ 4.6k updated 15d ago

TFDS is a collection of datasets ready to use with TensorFlow, Jax, ...

TensorFlow Datasets is a Python library that gives machine learning practitioners easy access to hundreds of public datasets in a consistent format. Instead of writing custom code to download, parse, and prepare each dataset, you call a single function with the dataset name and get back a ready-to-iterate data pipeline.

The library is part of the TensorFlow ecosystem but it also works with JAX and NumPy. A short code example in the README shows loading the MNIST handwritten digit dataset in a few lines, then applying shuffling, batching, and prefetching before looping through the data. These operations control how data flows through training, and the library is designed to follow performance best practices so the data pipeline does not become a bottleneck during model training.

A key design goal is reproducibility: every user who loads the same dataset with the same settings gets the same examples in the same order. This matters for comparing experiments across machines or teams.

The library does not host the underlying datasets itself. It downloads them from their original sources and prepares them locally. The README is clear that users are responsible for checking whether they have rights to use a given dataset under its own license.

If a dataset you need is not in the catalog, the project has a guide for adding one, and there is a GitHub issue tracker where you can request datasets and vote on existing requests. Documentation including a full catalog of available datasets lives at tensorflow.org/datasets. The library is licensed under Apache 2.0.

Open on GitHub → Full breakdown on explaingit →