vaex

Python ★ 8.5k updated 2mo ago

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀

A Python data analysis library that lets you filter, aggregate, and visualize datasets with hundreds of millions or billions of rows on a standard laptop by reading data from disk lazily instead of loading it all into memory.

PythonHDF5Apache ArrowAmazon S3Jupytersetup: moderatecomplexity 3/5

Vaex is a Python library for working with very large datasets, in the range of hundreds of millions or billions of rows, without running out of memory. Most Python data tools load the entire dataset into RAM before doing anything with it, which becomes impractical when files are larger than what your computer can hold. Vaex sidesteps this by reading data directly from disk only when a calculation actually requires it, a technique called memory mapping and lazy evaluation.

The library provides a DataFrame interface similar to Pandas, a widely used Python data tool, but designed from the ground up for scale. You can filter rows, create new calculated columns, and run statistical aggregations across enormous files while the data itself stays on disk. Operations like grouping rows by category or joining two tables are parallelized to run on multiple processor cores at once, which is how the library reaches the billion-rows-per-second figures cited in the README.

Vaex supports reading files in HDF5 and Apache Arrow formats, and can stream data directly from cloud storage on Amazon S3. For visualization, it includes histogram and density plot tools that work interactively inside Jupyter notebooks, letting analysts explore billion-row datasets in a browser without waiting for slow full-data loads. It also integrates with machine learning workflows, allowing feature transformations to be applied lazily so nothing gets materialized into memory until training begins.

Installation is available through pip or conda, the two standard Python package managers. The library works on standard laptops and desktops, not just cloud clusters, which is the positioning the project emphasizes. The README links to several external articles with benchmarks comparing Vaex against other big-data Python tools and walkthroughs for specific use cases including flight data analysis and text processing.

Where it fits

Analyze a multi-gigabyte HDF5 or Arrow file with billions of rows on a regular laptop without running out of RAM.
Run fast groupby and statistical aggregations across massive datasets using all available CPU cores in parallel.
Explore huge datasets interactively in a Jupyter notebook using histogram and density plots that render without full data loads.
Apply feature transformations lazily for machine learning pipelines on large datasets before training begins.

Open on GitHub → Full breakdown on explaingit →