modin

Python ★ 10k updated 4mo ago

Modin: Scale your Pandas workflows by changing a single line of code

A drop-in replacement for pandas that speeds up data analysis by using all CPU cores, change one import line and your existing scripts run faster on large datasets without any other modifications.

PythonpandasRayDasksetup: easycomplexity 2/5

Modin is a Python library that speeds up data analysis code written for pandas, without requiring you to rewrite anything. Pandas is the standard Python tool for working with tables of data (spreadsheets, CSVs, databases), but it only uses a single processor core, which becomes a problem when datasets grow large. Modin fixes this by distributing the work across all available cores on your machine.

The change required to use Modin is one line: replace the pandas import statement with Modin's equivalent. Every function call, column name, and result stays the same. Existing notebooks and scripts continue to work as before, but often run significantly faster, especially on files that are a gigabyte or larger.

Behind the scenes, Modin can use different computation systems to parallelize the work. The supported options are Ray, Dask, and MPI (via a package called unidist). You can let Modin detect which one is installed automatically, or set an environment variable to pick one explicitly. Most users start with the Ray backend, which is the most commonly tested option.

Modin is particularly useful when pandas slows to a crawl or runs out of memory on large files. It includes options for processing data that does not fit entirely in RAM by spilling to disk when needed. The project notes that speedups are most visible on operations like reading files, filtering rows, and aggregating columns across large datasets.

Installation is through pip or conda on Linux, Windows, and macOS. Full documentation and a quickstart guide are available at modin.readthedocs.io. The project has an active Slack community and is available as a package on PyPI.

Where it fits

Speed up an existing pandas data cleaning script on large CSV files without rewriting any code.
Process datasets too large to fit in RAM using Modin's out-of-core mode that spills to disk.
Parallelize row filtering and column aggregation on multi-gigabyte data files across all available CPU cores.

Open on GitHub → Full breakdown on explaingit →