Apache Spark

Apache Spark – Quick Revision Notes

1. What is Apache Spark?

  • Open-source distributed computing system.
  • Processes large datasets quickly using in-memory computing.
  • Written in Scala; supports Java, Python (PySpark), R, and SQL.

2. Key Components

  • Spark Core: Base engine for distributed task scheduling, memory management, fault recovery.
  • Spark SQL: Executes SQL queries, supports DataFrames and integration with Hive.
  • Spark Streaming: Handles real-time data streams.
  • MLlib: Machine learning library (classification, regression, clustering, etc.).
  • GraphX: For graph processing and computations.

3. RDD (Resilient Distributed Dataset)

  • Immutable distributed collections of objects.
  • Lazily evaluated.
  • Two types of operations:
    • Transformations (e.g., map, filter): return new RDDs.
    • Actions (e.g., count, collect): return results.

4. DataFrame & Dataset

  • DataFrame: Table-like data structure with named columns.
  • Dataset: Strongly typed, combines RDD and DataFrame features (Scala & Java only).
  • More optimized than RDD using Catalyst & Tungsten engine.

5. Cluster Managers

  • Spark can run on:
    • Standalone mode
    • Apache Mesos
    • Hadoop YARN
    • Kubernetes

6. Execution Workflow

  1. Driver program runs SparkContext.
  2. Jobs are divided into stages.
  3. Stages are divided into tasks.
  4. Tasks are executed on worker nodes.

7. Common Transformations & Actions

  • Transformations: map, flatMap, filter, groupByKey, reduceByKey
  • Actions: collect(), count(), first(), take(n), reduce()

8. Lazy Evaluation

  • Spark delays execution until an action is called.
  • Optimizes the job as a Directed Acyclic Graph (DAG).

9. Fault Tolerance

  • Achieved using RDD lineage.
  • Lost data can be recomputed from original transformations.

10. Benefits of Spark

  • In-memory computation (much faster than Hadoop MapReduce).
  • Supports batch and real-time processing.
  • Rich APIs for multiple languages.
  • Scalable and fault-tolerant.