spark

Scala ★ 43k updated 2h ago

Apache Spark - A unified analytics engine for large-scale data processing

Apache Spark is a fast large-scale data processing engine that keeps data in memory to run ETL pipelines, machine learning, SQL queries, and real-time stream processing across clusters.

ScalaJavaPythonRSQLsetup: hardcomplexity 5/5

Apache Spark is a unified analytics engine built for large-scale data processing. It was designed to address the limitations of MapReduce by keeping data in memory across computation stages rather than writing intermediate results to disk, which makes it dramatically faster for iterative workloads like machine learning and interactive queries.

Spark provides high-level APIs in Scala, Java, Python, and R, so teams can work in whichever language fits their existing stack. The engine is divided into several integrated modules. Spark SQL lets you query structured data using SQL or a DataFrame API and integrates with Hive, Parquet, JSON, and other formats. MLlib offers scalable implementations of common machine learning algorithms. GraphX is the built-in library for graph computation. Structured Streaming brings the same DataFrame model to real-time data streams, enabling low-latency processing of Kafka, file, or socket sources.

You would choose Spark when your data is too large to process on a single machine and you need a framework that scales horizontally across a cluster. Typical use cases include ETL pipelines transforming terabytes of raw logs into clean datasets, training machine learning models on large datasets, running ad-hoc analytical SQL queries over data lakes, and processing event streams in near-real time. Spark runs on Hadoop YARN, Apache Mesos, Kubernetes, and in standalone mode, and integrates natively with cloud object stores like S3 and Azure Blob Storage. The primary language is Scala, but Python via PySpark is the most widely used interface in data science teams.

Where it fits

Build ETL pipelines that transform terabytes of raw logs into clean datasets stored in a data lake.
Train machine learning models on large datasets that don't fit on a single machine using Spark MLlib.
Run ad-hoc SQL analytical queries over Parquet or JSON data stored on S3 or HDFS.
Process real-time event streams from Kafka with low latency using Spark Structured Streaming.

Open on GitHub → Full breakdown on explaingit →