Section 5 chapters · 22 of 22 lessons

PySpark

Understand the cluster, the DAG, the Catalyst optimizer, and the shuffle that ate your job. Learn PySpark the way platform engineers actually run it.

0 / 22 lessons

Start with What 'big data' actually means

The PySpark journey 0 / 22 completed

Chapter 01
Big Data Background
4 lessons
01 What 'big data' actually means Past the marketing. "Big" means data that won't fit in one machine's RAM — or processing that takes hours on one box. Everything else is just data. Beginner5 min
02 Hadoop overview HDFS, MapReduce, YARN. Why Hadoop won in 2010, why Spark replaced MapReduce, and why the mental model still matters in a cloud-native world. Beginner7 min
03 HDFS architecture How HDFS stores a petabyte across hundreds of cheap disks without losing data. The mental model still applies to S3, Iceberg, and Delta. Beginner7 min
04 The Spark ecosystem Spark Core, Spark SQL, Structured Streaming, MLlib, GraphX. Plus Delta, Iceberg, Databricks, DBT — where each piece fits in a modern stack. Beginner5 min
Chapter 02
Spark Architecture
4 lessons
05 Driver & executors Three actors run every Spark job — Driver, Executors, Cluster Manager. Knowing who does what is the difference between debugging in minutes vs hours. Intermediate7 min
06 Jobs → stages → tasks Every PySpark line eventually becomes a Job → Stages → Tasks. Knowing the hierarchy is how you read the Spark UI and find the slow stage. Intermediate7 min
07 DAG & lazy evaluation PySpark transformations don't run when you write them. They build a DAG. Actions trigger the actual work. Master this and Spark stops surprising you. Intermediate6 min
08 Shuffles When data moves across executors, you pay in disk, network, and time. Knowing what triggers a shuffle is half of Spark performance tuning. Advanced8 min
Chapter 03
DataFrames
5 lessons
09 DataFrame intro Spark's DataFrame is a typed Dataset[Row] with a query optimizer behind it. It replaced RDDs for almost everything — and it's the one API you shoul… Beginner6 min
10 Schemas Letting Spark infer a schema is convenient. It's also a slow, sometimes-wrong default. Explicit StructTypes are the production answer. Intermediate6 min
11 Joins in Spark Inner, left, semi, anti — and the only thing that really matters for performance: broadcast vs sort-merge. Pick the strategy and Spark stops being… Intermediate8 min
12 Window functions Running totals, ranks, lag/lead — the same window functions you know from SQL, in the DataFrame API. The pattern that solves half of analytics work. Advanced7 min
13 Pandas UDFs Plain Python UDFs serialize one row at a time over a slow JVM-Python boundary. Pandas UDFs use Apache Arrow and operate on whole batches — and they… Advanced7 min
Chapter 04
Internals & Optimization
5 lessons
14 Catalyst optimizer Catalyst rewrites your DataFrame pipeline before it runs. Predicate pushdown, column pruning, constant folding — the work senior engineers used to… Advanced8 min
15 Reading execution plans `.explain()` shows the physical plan Spark will run. Learn to read it top-down and you can debug almost any slow job — find the shuffle, find the b… Advanced7 min
16 Adaptive Query Execution Catalyst optimizes before execution. AQE optimizes during. Three runtime tricks that make naive code fast — and the reason Spark 3 doesn't need as… Advanced6 min
17 Partitioning Two meanings of "partition" you need to keep straight. Runtime partitions split tasks across executors; storage partitions split files on disk and… Advanced7 min
18 Skew & salting One partition has 100x the data. The whole stage waits for that one executor. Skew is the most common production Spark problem — and it has known f… Advanced7 min
Chapter 05
Databricks in Production
4 lessons
19 Databricks platform overview The managed Spark platform that most data teams now deploy to. Workspaces, clusters, Unity Catalog, Photon — and an honest look at when it's worth… Intermediate9 min
20 Delta Lake & MERGE The storage layer that turned data lakes into something you can actually trust. Time travel, schema enforcement, and the MERGE that finally made up… Intermediate10 min
21 Jobs, Asset Bundles, scheduling Notebooks are great for prototyping, terrible for production. Here's how teams actually ship Spark jobs on Databricks — Asset Bundles, task DAGs, a… Intermediate10 min
22 MLflow + Unity Catalog serving Databricks invented MLflow, then built the rest of an ML platform around it. Tracking, Unity Catalog model registry, and serving endpoints — what's… Advanced9 min
End of section 0 / 22 complete

Make it stick — pass every quiz.

Each lesson has a short quiz at the bottom. Passing the quiz is what marks the lesson complete and counts toward your certificate.
Section complete 22 / 22 lessons

Nice work — you finished PySpark.

Certificates are earned per learning path, not per section. Here's where this section takes you:
- Data Engineer Python + SQL + Spark, with the warehouse knowledge to glue them together.
- MLOps / Platform Docker, CI/CD, MLflow, serving, Kubernetes. The glue that runs production ML.

FAQCommon questions

PySpark — frequently asked questions

Straight answers to the questions people ask most about pyspark.

When do I actually need Spark instead of Pandas?

Reach for Spark when your data is too large for one machine's memory or you need a cluster to process it in parallel — typically tens of gigabytes and up. For data that fits comfortably in RAM, Pandas (or Polars) is simpler and faster; Spark's distributed overhead only pays off at scale.

What does lazy evaluation mean in Spark?

Spark doesn't run transformations as you write them — it builds a plan and only executes when an action (like `count`, `collect`, or `write`) is called. This lets its Catalyst optimiser reorder and combine steps, but it also means errors can surface only at the action.

Read the lesson

What's the difference between a transformation and an action in Spark?

Transformations (`select`, `filter`, `join`, `groupBy`) describe what to compute and return a new DataFrame lazily; actions (`count`, `show`, `collect`, `write`) trigger execution and return results. Nothing runs until an action is called.

Why is my Spark job slow or running out of memory?

The usual cause is data skew or wide shuffles — joins and groupBy operations that move data across the cluster. Check for skewed keys, avoid `collect()` on large data, cache reused DataFrames, and let Adaptive Query Execution (AQE) tune partitions.

What's the difference between the Spark driver and executors?

The driver runs your program and builds the execution plan; executors are the worker processes across the cluster that run tasks on partitions of the data. Pulling too much data back to the driver (e.g. `collect()`) is a common source of out-of-memory errors.

PySpark

Big Data Background

Spark Architecture

DataFrames

Internals & Optimization

Databricks in Production

Make it stick — pass every quiz.

Nice work — you finished PySpark.

PySpark — frequently asked questions