Why Apache Spark is must have skill for Data Engineering?

4 min readMay 16, 2021

Overview

I started learning Apache Spark about a year ago now. In this story I will try to explain why Apache Spark has become ubiquitous in enterprise big data processing both for batch and streaming workloads. There are ton of technologies in data engineering space which have come and gone. But Apache Spark stands out as one which will last for years to come. Lets get started with some history and background. Apache Spark started as a research project at the UC Berkeley AMPLab in 2009, and was open sourced in early 2010 and moved to the Apache Software Foundation in 2013. The fundamental motive and goal behind developing the Apache Spark framework was to overcome the complexity and inefficiency of Hadoop MapReduce processing.

Understanding Spark Architecture

Spark uses distributed processing to parallelize the data workload for faster execution with fault tolerance and resiliency.

Apache Spark Architecture is based on two main abstractions-

Resilient Distributed Datasets (RDD)

— RDDs are a read-only collection of partitioned data objects across different machines and fault-tolerant as the same exact copy can be created from scratch in case of process failure. or node failure

Directed Acyclic Graph (DAG)

— DAG is a sequence of computations(tasks) performed on data where each node is an RDD partition and edge is a transformation on top of data. The DAG abstraction helps eliminate the Hadoop MapReduce multi stage execution model and provides performance enhancements over Hadoop.

Spark Driver

— The driver program that runs on the master node of the spark cluster schedules the job execution and negotiates with the cluster manager. Driver contains various components — DAGScheduler, TaskScheduler, etc responsible for the translation of spark user code into actual spark jobs executed on the cluster.

Cluster Manager

— An external service responsible for acquiring resources on the spark cluster and allocating them to a spark job. The system currently supports several cluster managers:

Standalone — a simple cluster manager included with Spark that makes it easy to set up a cluster.
Apache Mesos — a general cluster manager that can also run Hadoop MapReduce and service applications.
Hadoop YARN — the resource manager in Hadoop 2.
Kubernetes — an open-source system for automating deployment, scaling, and management of containerized applications.

Spark Executor

— Executor is a distributed agent responsible for the execution of tasks. Every spark applications has its own executor process.

Spark Task

— A unit of work that will be sent to one executor. Spark job is divided into smaller Task and run in parallel across executors to achieve performance efficiency. In the event of a task failure, its restarted to provide resiliency.

Spark Modules

SQL Module

— Spark SQL is Apache Spark’s module for working with structured data. Spark SQL lets you query structured data inside Spark programs using SQL. This layer provides higher level abstractions to RDD using DataFrames. DataFrames and SQL provide a common way to access a variety of data sources, including Hive, Avro, Parquet, ORC, JSON, and JDBC. You can even join data across these sources using Spark Dataset and DataFrame API.

Streaming

— Spark Streaming makes it easy to build scalable fault-tolerant streaming applications. By running on Spark, Spark Streaming lets you reuse the same code for batch processing, join streams against historical data, or run ad-hoc queries on stream state. New Spark Structured Streaming is suitable for low latency , single digit milli-seconds latency enterprise use case.

MLlib

— MLlib is Apache Spark’s scalable machine learning library. MLlib contains many algorithms and utilities. Spark excels at iterative computation, enabling MLlib to run fast. This is module has been renamed to ML, we should expect lot of new features and enhancement to this module in future.

GraphX

— GraphX is Apache Spark’s API for graphs and graph-parallel computation. GraphX unifies ETL, exploratory analysis, and iterative graph computation within a single system.

So Why Apache Spark?

Speed

Apache Spark achieves high performance for both batch and streaming data, using a state-of-the-art DAG scheduler, a query optimizer, and a physical execution engine.

Ease of Use

Write applications quickly in Java, Scala, Python, R, and SQL. Spark offers over 80 high-level operators that make it easy to build parallel apps. And you can use it interactively from the Scala, Python, R, and SQL shells.

Generality

Combine SQL, streaming, and complex analytics. Spark powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application

Runs Everywhere

Spark runs on Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud. It can access diverse data sources.

Final Thoughts

Its been a decade since Apache Spark Framework was open sourced. In this time, framework has seen wide enterprise adoption and continues to grow. Very rarely a framework survives 10 years in the fast moving technology world. Company DataBricks is the key force behind the success of Apache Spark. Team behind this framework continues to add new features like low latency structured streaming and ML/AI capabilities. It provides unified capabilities to build data pipelines. With no real competition in sight, looks like Apache Spark is going to be a framework of choice for building batch and streaming big data pipeline for the foreseeable future.

Go build data pipelines using Apache Spark!

Disclaimer: This is a personal blog. The opinions expressed here represent my own and not those of my current or any previous employers.