dolphinscheduler

Java ★ 14k updated 3d ago

Apache DolphinScheduler is the modern data orchestration platform. Agile to create high performance workflow with low-code

An open-source workflow scheduler for data pipelines, build, run, and monitor multi-step data tasks visually without much code, handling tens of millions of tasks per day at scale.

JavaPythonDockerKubernetessetup: hardcomplexity 4/5

Apache DolphinScheduler is a tool for planning and running data workflows. A workflow is a series of tasks that need to run in a specific order, for example: pull data from a database, transform it, then load it somewhere else. DolphinScheduler handles the scheduling of those tasks, tracks dependencies between them, and keeps everything running reliably.

You build workflows through a visual drag-and-drop interface in a web browser, without writing much code. There is also a Python programming interface and an API for teams that prefer to manage things programmatically. The tool supports a wide range of task types out of the box, meaning you can connect it to many common data systems without custom plugins.

It is built to handle large volumes of work. The README states it can process tens of millions of tasks per day and performs several times faster than comparable tools. It uses a distributed architecture where multiple servers share the load, so you can add more capacity by adding more machines rather than replacing existing hardware.

You can run it in several ways: as a single-server setup for quick evaluation, as a cluster for production use, or inside Docker or Kubernetes container environments. It supports connecting to many external databases including MySQL, PostgreSQL, Hive, and Trino. There is also built-in monitoring so you can see server health and resource usage in a browser without logging into the machines directly.

The project is part of the Apache Software Foundation and is open source under the Apache 2.0 license.

Where it fits

Schedule a daily ETL pipeline that pulls from a database, transforms data, and loads it to a warehouse using a visual drag-and-drop editor.
Monitor the health and resource usage of your data workflows from a browser dashboard without SSHing into servers.
Run high-volume data workflows using distributed cluster mode to process tens of millions of tasks per day.

Open on GitHub → Full breakdown on explaingit →