datafusion

Rust ★ 8.9k updated 2h ago

Apache DataFusion SQL Query Engine

Apache DataFusion is a fast, embeddable SQL and DataFrame query engine written in Rust that lets developers build database tools and data pipelines without writing query execution from scratch.

RustPythonApache ArrowParquetSQLsetup: moderatecomplexity 4/5

Apache DataFusion is a query engine, meaning it is software that lets you run SQL queries or DataFrame-style operations against data stored in files like CSV, Parquet, JSON, and Avro. It is written in the Rust programming language and is designed to process data quickly by working on multiple columns at once and using many CPU threads in parallel.

The project is aimed at developers who want to build their own database tools, data pipelines, or custom query systems, rather than being a finished end-user product. You bring your data and your application, and DataFusion provides the core machinery for parsing queries, planning how to execute them, and running them efficiently. You can extend it with your own data sources, functions, and operators.

Two related projects make DataFusion more accessible without coding in Rust. DataFusion Python provides a Python interface so you can run SQL or DataFrame queries from Python scripts. DataFusion Comet is a plugin for Apache Spark that uses DataFusion to speed up Spark jobs.

Out of the box, DataFusion includes a full SQL parser, support for common file formats, date and time functions, cryptographic functions, regular expression functions, and Unicode handling. Many of these features are optional and can be turned on or off depending on what your project needs.

The project is part of the Apache Software Foundation and follows Apache governance. It has an active community, documentation on its website, and a Discord channel for discussion. The README links to getting-started guides for both Rust developers and Python users.

Where it fits

Build a custom query engine in Rust that runs SQL against CSV, Parquet, or JSON files without a database server.
Run fast SQL queries on local data files from a Python script using the DataFusion Python bindings.
Speed up Apache Spark jobs by using DataFusion Comet as a drop-in execution plugin.
Create a data pipeline that reads large Parquet files, filters and aggregates them with SQL, and writes results efficiently.

Open on GitHub → Full breakdown on explaingit →