ProIncSrvc

Loading

Apache Spark

Apache Spark

May, Fri, 2025
proincteam
Technology
0
65 views
2 minutes Read

Apache Spark – Quick Revision Notes

1. What is Apache Spark?

Open-source distributed computing system.
Processes large datasets quickly using in-memory computing.
Written in Scala; supports Java, Python (PySpark), R, and SQL.

2. Key Components

Spark Core: Base engine for distributed task scheduling, memory management, fault recovery.
Spark SQL: Executes SQL queries, supports DataFrames and integration with Hive.
Spark Streaming: Handles real-time data streams.
MLlib: Machine learning library (classification, regression, clustering, etc.).
GraphX: For graph processing and computations.

3. RDD (Resilient Distributed Dataset)

Immutable distributed collections of objects.
Lazily evaluated.
Two types of operations:
- Transformations (e.g., map, filter): return new RDDs.
- Actions (e.g., count, collect): return results.

4. DataFrame & Dataset

DataFrame: Table-like data structure with named columns.
Dataset: Strongly typed, combines RDD and DataFrame features (Scala & Java only).
More optimized than RDD using Catalyst & Tungsten engine.

5. Cluster Managers

Spark can run on:
- Standalone mode
- Apache Mesos
- Hadoop YARN
- Kubernetes

6. Execution Workflow

Driver program runs SparkContext.
Jobs are divided into stages.
Stages are divided into tasks.
Tasks are executed on worker nodes.

7. Common Transformations & Actions

Transformations: map, flatMap, filter, groupByKey, reduceByKey
Actions: collect(), count(), first(), take(n), reduce()

8. Lazy Evaluation

Spark delays execution until an action is called.
Optimizes the job as a Directed Acyclic Graph (DAG).

9. Fault Tolerance

Achieved using RDD lineage.
Lost data can be recomputed from original transformations.

10. Benefits of Spark

In-memory computation (much faster than Hadoop MapReduce).
Supports batch and real-time processing.
Rich APIs for multiple languages.
Scalable and fault-tolerant.

Leave a Reply Cancel reply