iceberg

Java ★ 9.0k updated 3h ago

Apache Iceberg

A table format standard that lets multiple data tools like Spark and Flink safely read and write the same massive dataset at once, with transactions, row-level updates, and version history.

JavaSparkFlinkPythonGoRustsetup: hardcomplexity 4/5

Apache Iceberg is a table format for storing and querying very large datasets. Think of it as a standardized way to organize enormous collections of data files on disk or in cloud storage so that multiple different analysis tools can read and write to the same data safely, even at the same time.

The problem it solves is that large-scale data analysis typically involves many separate tools. One tool might be reading sales records while another is writing new ones, or different teams might use different processing engines depending on what they are comfortable with. Without a shared format that understands transactions and versioning, these tools can conflict with each other or produce inconsistent results. Iceberg provides a stable specification that tools like Spark, Flink, Trino, Presto, Hive, and Impala can all integrate with, giving them a consistent view of the data.

Iceberg also handles features you would expect from a proper database table: you can update or delete individual rows, roll back to a previous version of the data if something goes wrong, and run queries efficiently without scanning every file. These capabilities are unusual for file-based storage systems, which traditionally treat data as append-only.

This repository is the reference Java implementation. There are also separate community implementations in Go, Python, Rust, and C++ for teams using other languages. The Java library is what most processing engines integrate against directly.

This is infrastructure-level software for data engineering teams. It is not an end-user application but a core component in data warehouse and analytics platform stacks.

Where it fits

Build a data lakehouse where Spark writes new records and Trino queries them simultaneously without data corruption.
Roll back a large dataset to a previous version after a bad pipeline run overwrites critical records.
Share a single dataset between teams using different processing engines like Flink and Presto without format conflicts.
Update or delete individual rows in file-based storage without rewriting entire data files.

Open on GitHub → Full breakdown on explaingit →