airflow
Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
Apache Airflow is a Python platform for scheduling and monitoring automated multi-step workflows in code, ideal for running nightly data pipelines, machine learning jobs, or any batch process with ordered steps that must be reliable.
Apache Airflow is a platform for defining, scheduling, and monitoring automated workflows — sequences of tasks that need to run in a specific order, on a schedule, possibly depending on each other. Think of it as a very sophisticated job scheduler that lets you describe a pipeline of work in code rather than through a graphical tool or a rigid configuration file. The classic use case is data engineering: for example, every night at 2 AM, pull data from a database, clean it up, load it into a warehouse, and send a summary report — all as a chain of steps that Airflow manages automatically.
The central concept in Airflow is the DAG, which stands for Directed Acyclic Graph. A DAG is simply a Python file where you describe which tasks exist and in what order they must run. Airflow reads these files, figures out the dependencies between tasks, and runs them on a pool of worker processes or machines. If one task fails, Airflow marks it as failed and can alert you, retry it, or stop downstream steps accordingly. A built-in web interface lets you visualize your pipelines as flow diagrams, inspect logs, manually trigger runs, and backfill historical data — meaning you can re-run a workflow as if it were running on a past date.
You would use Airflow when you have repetitive multi-step processes that need to be reliable, visible, and easy to version-control alongside your code. It fits data teams that need to orchestrate ETL pipelines (extract, transform, load), machine learning training jobs, or any batch process with dependencies. The tech stack is Python throughout, with a web UI built on Flask, and the platform runs on any infrastructure from a single server to Kubernetes clusters. It is installed via pip from PyPI.
Where it fits
- Schedule a nightly data pipeline that pulls data from a source, cleans it, and loads it into a data warehouse without any manual steps
- Monitor and automatically retry failed steps in a multi-stage data processing job through a visual web interface
- Orchestrate machine learning training jobs so that model training only starts after all data preparation steps finish successfully
- Re-run a historical pipeline for a specific past date range using Airflow's backfill feature to fill in missing data