gitmyhub

beam

Java ★ 8.6k updated 4h ago

Apache Beam is a unified programming model for Batch and Streaming data processing.

A framework for writing data processing pipelines once in Java, Python, or Go and running them on your laptop, Google Cloud Dataflow, Apache Spark, or Flink without changing the code.

JavaPythonGoApache FlinkApache SparkGoogle Cloud Dataflowsetup: moderatecomplexity 4/5

Apache Beam is an open-source framework for writing data processing programs that can run on many different computing systems without changing the code. You write the logic once using Java, Python, or Go, and then choose where to run it: on your laptop for testing, or on a large distributed cluster for production workloads.

The core idea is that data processing follows a common pattern regardless of scale. You define a pipeline, which is a graph of steps. Each step takes a collection of data as input, does something to it (filter, transform, aggregate), and produces another collection as output. Beam calls these collections PCollections and the steps PTransforms. The same code works whether the data is a fixed file you process once (batch) or a live stream of events arriving continuously (streaming).

Because the code is separate from where it runs, Beam supports several execution environments called runners. The DirectRunner runs everything on your local machine, which is useful for development and testing. The DataflowRunner submits the job to Google Cloud Dataflow. The FlinkRunner and SparkRunner send it to Apache Flink or Apache Spark clusters respectively. Switching runners is a configuration change, not a code change.

This design came out of earlier Google internal systems, including MapReduce and a streaming processing model that Google researchers published around 2015. Beam brought that model into open source under the Apache Software Foundation.

The repository contains the SDK code for all three languages, the runner implementations, and a large set of example programs including a classic word-count example recommended for first-time users. Documentation and quickstart guides for each language are available on the official Apache Beam website.

Where it fits