gitmyhub

luigi

Python ★ 19k updated 2d ago

Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.

Luigi is a Python library for automating multi-step data pipelines, it runs tasks in the right order, skips completed steps, and handles failures, so you don't manually manage complex workflows.

PythonHadoopSparkHivePigsetup: easycomplexity 3/5

Luigi is a Python library for building and managing automated pipelines — sequences of tasks that need to run in a specific order, where each step depends on the results of previous ones. Think of it like a makefile for long-running data jobs: you describe what each task needs as input and what it produces as output, and Luigi handles running everything in the right order, skipping tasks that are already done, and retrying or reporting failures.

It was originally developed at Spotify and used internally to run thousands of tasks every day, including machine learning jobs, data exports, and internal dashboards. The library is particularly suited for workflows that take hours or days to complete and involve many interdependent steps, such as processing large datasets or training models.

Luigi comes with support for common data infrastructure including Hadoop, Hive, Pig, and Spark jobs, as well as database operations. Every piece of logic — including the dependency graph — is written in plain Python rather than configuration files or domain-specific languages, which makes it easy to express complex dependencies like date-based calculations. A web interface is included for searching and visualizing the dependency graph and task statuses. Luigi is installed via pip.

Where it fits