Apache Spark
Apache Spark – Quick Revision Notes
1. What is Apache Spark?
- Open-source distributed computing system.
- Processes large datasets quickly using in-memory computing.
- Written in Scala; supports Java, Python (PySpark), R, and SQL.
2. Key Components
- Spark Core: Base engine for distributed task scheduling, memory management, fault recovery.
- Spark SQL: Executes SQL queries, supports DataFrames and integration with Hive.
- Spark Streaming: Handles real-time data streams.
- MLlib: Machine learning library (classification, regression, clustering, etc.).
- GraphX: For graph processing and computations.
3. RDD (Resilient Distributed Dataset)
- Immutable distributed collections of objects.
- Lazily evaluated.
- Two types of operations:
- Transformations (e.g.,
map
,filter
): return new RDDs. - Actions (e.g.,
count
,collect
): return results.
- Transformations (e.g.,
4. DataFrame & Dataset
- DataFrame: Table-like data structure with named columns.
- Dataset: Strongly typed, combines RDD and DataFrame features (Scala & Java only).
- More optimized than RDD using Catalyst & Tungsten engine.
5. Cluster Managers
- Spark can run on:
- Standalone mode
- Apache Mesos
- Hadoop YARN
- Kubernetes
6. Execution Workflow
- Driver program runs
SparkContext
. - Jobs are divided into stages.
- Stages are divided into tasks.
- Tasks are executed on worker nodes.
7. Common Transformations & Actions
- Transformations:
map
,flatMap
,filter
,groupByKey
,reduceByKey
- Actions:
collect()
,count()
,first()
,take(n)
,reduce()
8. Lazy Evaluation
- Spark delays execution until an action is called.
- Optimizes the job as a Directed Acyclic Graph (DAG).
9. Fault Tolerance
- Achieved using RDD lineage.
- Lost data can be recomputed from original transformations.
10. Benefits of Spark
- In-memory computation (much faster than Hadoop MapReduce).
- Supports batch and real-time processing.
- Rich APIs for multiple languages.
- Scalable and fault-tolerant.