storm

Java ★ 8.8k updated 8y ago

Distributed and fault-tolerant realtime computation: stream processing, continuous computation, distributed RPC, and more

Storm is a distributed system for processing continuous streams of data in real time across a cluster of machines, reacting to each event as it arrives rather than processing in batches like Hadoop.

Javasetup: hardcomplexity 4/5

Storm is a system for processing continuous streams of data in real time, spread across multiple computers working together. The core idea is that data keeps arriving in a constant flow, and Storm lets you write programs that react to each piece of data as it comes in rather than waiting to collect everything first and then processing it in bulk.

The README draws a comparison to Hadoop, which is a well-known tool for batch processing large datasets on multiple machines. Storm does the same for real-time data: it gives developers building blocks for splitting up stream-processing work across many computers, making sure the work keeps running even if some machines fail, and doing it all at high speed.

It supports a variety of use cases according to the description: stream processing (reacting to each event as it arrives), continuous computation (keeping running tallies or aggregations updated as data flows in), and distributed remote procedure calls (sending a request to be computed across many nodes and getting a result back quickly).

Storm was designed to work with any programming language, not just Java. The original project was created by Nathan Marz and later donated to the Apache Software Foundation, which hosts the project's mailing lists and ongoing development. This GitHub repository is the original pre-Apache version.

The README is sparse on installation or configuration details and points readers to a separate wiki for documentation and tutorials. The project has been used by a number of companies, and a link to a list of those is included in the README.

Where it fits

Build a real-time analytics pipeline that reacts to each incoming event, a click, a transaction, a sensor reading, the moment it arrives.
Keep a running count or aggregation updated continuously as a stream of data flows in, without waiting for a nightly batch job.
Distribute a compute-heavy calculation across many machines and collect the combined result quickly using distributed remote procedure calls.

Open on GitHub → Full breakdown on explaingit →