awesome-data-engineering

★ 8.8k updated 5d ago

A curated list of data engineering tools for software developers

A curated reference list of tools and technologies used in data engineering, covering databases, data ingestion, stream and batch processing, orchestration, monitoring, and community resources. No code, just organized links to relevant projects.

Apache KafkaApache SparkPostgreSQLRedisMongoDBElasticsearchsetup: easycomplexity 1/5

This is a curated list of tools and technologies used in data engineering. Data engineering is the field focused on building and maintaining the systems that collect, store, move, and process large amounts of data so that analysts and data scientists can work with it. This repository does not contain code; it is a reference list of links to relevant projects, grouped by category.

The list covers databases of many types: relational databases like PostgreSQL and MySQL, key-value stores like Redis and DynamoDB, column-oriented databases like Cassandra and ClickHouse, document databases like MongoDB and Elasticsearch, graph databases like Neo4j, time series databases like InfluxDB, and distributed databases. Each entry is a short description with a link to the project.

Beyond storage, the list covers tools for moving and processing data. There are sections on data ingestion (tools for getting data from one place to another, such as Apache Kafka and Logstash), stream processing (handling data as it arrives in real time), and batch processing (working through large stored datasets, with tools like Apache Spark and Hadoop). There are also sections on file systems and serialization formats, which are the ways data is structured and stored on disk.

The list extends into operational concerns, with sections on workflow orchestration tools (for scheduling and coordinating data pipelines), monitoring, data quality testing, and data profiling (understanding the shape and content of a dataset). There is also coverage of charts and dashboards, ELK stack tooling, and Docker-related resources.

At the end the list points to community resources including forums, conferences, podcasts, and books related to data engineering. It is formatted as a standard Awesome list, a common GitHub convention for community-maintained reference collections. The full README is longer than what was shown.

Where it fits

Quickly find the right database type for your data project, relational, document, graph, or time series, from one organized reference list
Discover stream or batch processing tools like Kafka or Spark when planning a new data pipeline
Find orchestration and monitoring tools to schedule and observe your data workflows

Open on GitHub → Full breakdown on explaingit →