kedro

Python ★ 11k updated 1d ago

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.

Python framework that turns messy data science notebooks into organized, reusable pipelines with a standard project layout, a config-driven data catalog, and automatic step ordering based on how your functions connect.

PythonJupyterArgoPrefectKubeflowAWS BatchDatabrickssetup: moderatecomplexity 3/5

Kedro is an open-source Python framework for building data engineering and data science pipelines in a structured, reusable way. It was created to address the common problem that data science work often starts as messy Jupyter notebooks or one-off scripts that become hard to maintain, share, or move into production. Kedro brings software engineering practices to data work so that pipelines are easier to understand, test, and reuse across a team.

The main building blocks Kedro provides are a project template, a data catalog, and a pipeline abstraction. The project template gives you a standard folder structure so that new projects start consistently. The data catalog is a configuration-driven system for connecting to different data sources and destinations, including local files, cloud storage, databases, and other formats, without scattering connection details through your code. The pipeline abstraction lets you write your data processing steps as ordinary Python functions and then declare how they connect to each other. Kedro resolves the execution order automatically based on those connections.

Kedro also includes an optional visualization tool called Kedro-Viz that generates an interactive diagram of your pipeline, showing how data flows between steps. This can be useful for communicating what a pipeline does to teammates who did not write it.

On the deployment side, Kedro supports running pipelines on a single machine or distributed across clusters. It integrates with orchestration platforms including Argo, Prefect, Kubeflow, AWS Batch, and Databricks.

Kedro is hosted by the LF AI and Data Foundation, an organization that provides neutral governance for open-source AI and data projects. The code is released under the Apache 2.0 license. It supports Python 3.10 through 3.14 and can be installed via pip or conda in a few commands.

The README describes the project as coming out of real-world experience building machine-learning applications with large, messy datasets, and the problems that approach revealed when working in teams.

Where it fits

Refactor a one-off Jupyter notebook data pipeline into a testable, team-shareable Kedro project.
Connect a pipeline to multiple data sources, local CSVs, S3, databases, without hardcoding paths in your code.
Visualize how data flows between pipeline steps using Kedro-Viz to explain the process to non-technical teammates.
Deploy the same pipeline to AWS Batch or Databricks by swapping the runner without rewriting the pipeline logic.

Open on GitHub → Full breakdown on explaingit →