delta

Scala ★ 8.9k updated 1d ago

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs

Open-source storage layer that adds database-like reliability, concurrent writes, rollback, and time travel queries, to large data files stored in cloud object storage like S3 or Azure Data Lake.

ScalaJavaPythonRustApache SparkFlinkTrinoHivesetup: hardcomplexity 4/5

Delta Lake is an open-source storage layer designed to sit on top of existing data storage systems (such as cloud object storage like Amazon S3 or Azure Data Lake) and add capabilities that those systems do not provide on their own. The most important of those capabilities is the ability to treat large data files more like a database: you can read and write data reliably even when multiple processes are doing so at the same time, roll back to an earlier version of a dataset if something goes wrong, and get consistent results when querying data that is being updated by another job simultaneously.

The project is particularly common in data engineering and analytics contexts, where teams store massive amounts of structured data and process it with tools like Apache Spark, Flink, Trino, or Hive. Delta Lake integrates with all of those tools through connectors, so existing data pipelines can adopt it without a complete rewrite. APIs are available for Scala, Java, Python, Rust, and Ruby.

At a technical level, Delta Lake achieves its reliability guarantees by maintaining a transaction log alongside the actual data files. Every write to a Delta table is recorded as a transaction, and the log is what enables features like time travel (querying the state of a table at a past point in time), concurrent write safety, and the ability for newer versions of the software to always read tables written by older versions.

The project originated at Databricks and is now part of the Linux Foundation. It has a companion ecosystem of related repositories covering Rust bindings, data sharing, and Kafka ingestion. The core library here is written in Scala and requires Apache Spark as the primary compute engine for most use cases. The license is Apache 2.

Where it fits

Store large datasets in S3 and safely update them from multiple concurrent jobs without corrupting data.
Query the state of a data table as it looked at a past point in time using the time travel feature.
Replace brittle CSV or Parquet pipelines with a transaction-safe format that supports rollback if a bad write occurs.
Integrate Delta Lake into an existing Apache Spark or Flink pipeline without rewriting data processing code.

Open on GitHub → Full breakdown on explaingit →