deequ

Scala ★ 3.6k updated 3d ago

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.

Deequ is a library from AWS Labs that lets data engineers write quality checks for large datasets, in the same spirit as a programmer writes unit tests for code. If you have a table of customer records or product listings and you want to verify that certain columns are never empty, that IDs are always unique, or that numeric fields never go negative, Deequ gives you a structured way to express and enforce those rules automatically.

The library runs on top of Apache Spark, which is a system designed for processing very large amounts of data across multiple computers. That means Deequ can handle datasets with billions of rows that live in data warehouses or distributed file systems, not just small local files. You feed it tabular data, such as CSV files, database tables, or flattened JSON, and it translates your quality rules into Spark jobs that scan the data and report back.

The workflow is straightforward. You define a set of constraints, for example that a column should be at least 95% filled in, that a field should only contain certain allowed values, or that the median of a numeric column should fall within a certain range. Deequ checks each constraint against the actual data and tells you exactly which rules were violated and by how much. If 80% of a column is filled when you expected 100%, you see that number. You can then quarantine or fix bad records before they reach downstream applications or machine learning models.

Beyond one-off checks, the library includes tools for tracking metrics over time so you can spot when data quality starts to drift, and a data profiling mode that automatically summarizes what a dataset looks like without you needing to specify rules in advance. Python developers can access the same functionality through PyDeequ, a separate package that wraps this library.

Deequ is aimed at data engineers and analysts who work in Spark environments and want a repeatable, automated way to catch data problems early rather than discovering them after a pipeline has already delivered bad results.

Open on GitHub → Full breakdown on explaingit →