When should you use Spark instead of pandas, and what are the key trade-offs?

pandas operates in-memory on a single machine, making it fast and simple for datasets under a few gigabytes. Spark distributes computation across a cluster, handles terabyte-scale data, and integrates with cloud storage — but adds significant overhead for small data. The crossover point is roughly when your data no longer fits in RAM or when processing time on a single machine becomes unacceptable.

What is the difference between an RDD, a DataFrame, and a Dataset in Spark?

RDD is the low-level, type-safe distributed collection with no schema knowledge. DataFrame adds a named-column schema on top, enabling the Catalyst optimizer and codegen — but loses compile-time type safety. Dataset merges both worlds: it carries a schema and passes through Catalyst while remaining statically typed in Scala/Java.

Explain the Spark driver/executor model and what each component does.

The driver is a single JVM process that hosts the SparkContext, builds the DAG, schedules tasks, and coordinates results. Executors are JVM processes on worker nodes that actually run tasks and cache data. The cluster manager (YARN, Kubernetes, standalone) sits between them, allocating resources.

Compare Parquet, CSV, and Avro as big-data file formats — when do you use each?

Parquet is a columnar, compressed format optimized for analytical reads — only the queried columns are scanned. Avro is row-oriented, schema-embedded, and optimized for write-heavy pipelines and Kafka serialization. CSV is human-readable but schema-less, uncompressed, and slow at scale — use it only at system boundaries where a downstream tool requires it.

Databricks Jobs — productionizing your PySpark code

The last lesson left you with a MERGE that has to fire every night, retry if a node dies, and ship from version control instead of being clicked into a UI — and asked how you wrap reliable storage in a schedule that does all that. This lesson is that wrapper. Start from the gap it closes.

A notebook that works once is not a production job. A production job is one that runs every night at 2am, retries on failure, alerts the right people, depends on three upstream tables and feeds two downstream ones, and is reviewable in a pull request before it ships.

Databricks calls this layer Workflows (the runtime) and Jobs (the things being run). And as of 2024, the canonical way to manage them is Asset Bundles — a YAML-based IaC that you check into Git alongside your code.

Two patterns: notebook jobs and wheel jobs

Every Databricks job is one of two shapes:

Notebook job — point the job at a notebook in the workspace. The job clones the notebook into a run, executes top-to-bottom, captures output. Fast to set up, terrible for version control: the notebook lives in the workspace UI, not in your repo (unless you use Repos / Git Folders).

Wheel job — package your Python code as a wheel, install it on the cluster with pip install, and call an entry point. Your code lives in src/, has tests, ships through CI. This is the only sustainable pattern for anything you’ll maintain longer than a quarter.

The same dichotomy applies to JVM jobs (JAR) and SQL jobs (a .sql file or query reference). For PySpark, wheels win.

# src/my_pipeline/transform.py
from pyspark.sql import SparkSession, functions as F

def run(date: str, output_table: str):
    spark = SparkSession.builder.getOrCreate()
    df = (spark.read.table("main.raw.events")
            .filter(F.col("event_date") == date)
            .groupBy("country")
            .agg(F.count("*").alias("events"),
                 F.countDistinct("user_id").alias("users")))
    df.write.mode("overwrite").saveAsTable(output_table)

if __name__ == "__main__":
    import sys
    run(sys.argv[1], sys.argv[2])

Notice: a real function with arguments. Unit-testable. Importable from a notebook for debugging. This is what production looks like.

Asset Bundles — the IaC layer

A bundle is a directory with a databricks.yml file that declares your jobs, clusters, and artifacts. You deploy the bundle with one command; Databricks creates or updates everything to match.

A minimal bundle:

# databricks.yml
bundle:
  name: my-pipeline

artifacts:
  my_wheel:
    type: whl
    path: ./

variables:
  catalog:
    description: Target Unity Catalog catalog
    default: dev

resources:
  jobs:
    daily_pipeline:
      name: daily-pipeline-${var.catalog}
      tasks:
        - task_key: ingest
          python_wheel_task:
            package_name: my_pipeline
            entry_point: ingest
            parameters: ["${var.catalog}"]
          new_cluster:
            spark_version: "15.4.x-photon-scala2.12"
            node_type_id: "i3.xlarge"
            num_workers: 4

        - task_key: transform
          depends_on:
            - task_key: ingest
          python_wheel_task:
            package_name: my_pipeline
            entry_point: transform
            parameters: ["${var.catalog}"]
          new_cluster:
            spark_version: "15.4.x-photon-scala2.12"
            node_type_id: "i3.xlarge"
            num_workers: 4

        - task_key: notify
          depends_on:
            - task_key: transform
          notebook_task:
            notebook_path: ./notebooks/notify.py

      schedule:
        quartz_cron_expression: "0 0 2 * * ?"
        timezone_id: "UTC"

      email_notifications:
        on_failure: ["data-platform@example.com"]

targets:
  dev:
    workspace:
      host: https://dev.cloud.databricks.com
    variables:
      catalog: dev
  prod:
    workspace:
      host: https://prod.cloud.databricks.com
    variables:
      catalog: main

Deploying is one command per target:

databricks bundle deploy --target dev
databricks bundle deploy --target prod
databricks bundle run daily_pipeline --target prod

The bundle builds your wheel, uploads it to the workspace, creates the job, and wires up the schedule. If you change the YAML and re-deploy, it diffs and updates. If you delete a resource from YAML, the next deploy removes it from the workspace. The YAML is the source of truth — clicking around in the UI to “fix” a deployed job is now a code smell.

Task DAGs

A job is a directed acyclic graph (DAG) of tasks. Each task has a task_key and optional depends_on. The example above forms:

You can fan out — multiple tasks depending on one parent — or fan in. The scheduler runs tasks in topological order, parallelizing independent branches. If a task fails, downstream tasks are skipped unless they’re marked to run anyway.

A common production pattern is medallion — bronze ingest, silver clean, gold aggregate, then publish:

The data_quality_check runs alongside gold_agg; the alert task runs only if data_quality_check fails (run_if: AT_LEAST_ONE_FAILED).

Parameter passing

Tasks need to share context — a partition date, a run ID, a target catalog. Two mechanisms:

Job parameters are set at job-level and visible to all tasks via ${{job.parameters.X}} substitution:

parameters:
  - name: run_date
    default: "{{job.start_time.iso_date}}"

tasks:
  - task_key: ingest
    python_wheel_task:
      parameters: ["{{job.parameters.run_date}}"]

Task values let a task emit a value that downstream tasks read:

# In the upstream task
from databricks.sdk.runtime import dbutils
dbutils.jobs.taskValues.set("row_count", str(df.count()))

# In a downstream task
count = dbutils.jobs.taskValues.get(
    taskKey="ingest", key="row_count", default="0"
)

The task values are the equivalent of Airflow XComs (cross-task communication slots) — small bits of metadata that flow along the DAG edges.

Cluster strategy

Each task can declare new_cluster (a job cluster — created at task start, killed at end) or existing_cluster_id (use a long-running all-purpose cluster). The trade-off:

Pattern	When to use
Job cluster per task	Long tasks, isolated dependencies, lowest cost
Job cluster shared by tasks	Tasks share library versions; saves startup time
All-purpose cluster	Tasks need to be fast-to-start (small frequent jobs)

For most pipelines, one job cluster shared by all tasks is the sweet spot. Cluster startup adds 2-4 minutes, so you don’t want it once per task; an all-purpose cluster idling burns DBUs even when nothing runs. Shared job cluster: pay once per run.

job_clusters:
  - job_cluster_key: shared
    new_cluster:
      spark_version: "15.4.x-photon-scala2.12"
      node_type_id: "i3.xlarge"
      num_workers: 4

tasks:
  - task_key: ingest
    job_cluster_key: shared
    # ...
  - task_key: transform
    job_cluster_key: shared
    # ...

Retries and failure handling

Production tasks need retries (transient network failures, S3 5xx, cluster scaling issues). Add at the task level:

tasks:
  - task_key: ingest
    max_retries: 3
    min_retry_interval_millis: 120000   # 2 min between retries
    retry_on_timeout: true

For domain-specific failures (data quality, schema drift) you want a failure task — a task that only runs if its parent failed:

- task_key: post_to_slack
  depends_on:
    - task_key: ingest
  run_if: AT_LEAST_ONE_FAILED
  notebook_task:
    notebook_path: ./notebooks/slack_alert.py

Combine with on-failure email/PagerDuty hooks at the job level for defense in depth.

A toy task scheduler

The DAG model is just topological sort plus a state machine. Here’s the shape, in Python:

# A toy job scheduler — tasks with dependencies, runs in topological order.

from collections import defaultdict, deque

def schedule(tasks):
    """tasks: dict of name -> {'deps': [...], 'fn': callable}"""
    in_degree = {n: len(t["deps"]) for n, t in tasks.items()}
    children = defaultdict(list)
    for n, t in tasks.items():
        for d in t["deps"]:
            children[d].append(n)

    ready = deque(n for n, deg in in_degree.items() if deg == 0)
    statuses = {}

    while ready:
        n = ready.popleft()
        # Skip if any parent failed
        if any(statuses.get(d) == "FAILED" for d in tasks[n]["deps"]):
            statuses[n] = "SKIPPED"
            print(f"  {n}: SKIPPED (parent failed)")
        else:
            try:
                tasks[n]["fn"]()
                statuses[n] = "OK"
                print(f"  {n}: OK")
            except Exception as e:
                statuses[n] = "FAILED"
                print(f"  {n}: FAILED ({e})")

        for c in children[n]:
            in_degree[c] -= 1
            if in_degree[c] == 0:
                ready.append(c)

    return statuses


def fail():
    raise RuntimeError("bad rows")

tasks = {
    "ingest":    {"deps": [],                        "fn": lambda: print("    -> read raw")},
    "transform": {"deps": ["ingest"],                "fn": lambda: print("    -> clean")},
    "qa_check":  {"deps": ["transform"],             "fn": fail},
    "publish":   {"deps": ["transform", "qa_check"], "fn": lambda: print("    -> publish")},
    "alert":     {"deps": ["qa_check"],              "fn": lambda: print("    -> slack alert")},
}

print("Running DAG:")
schedule(tasks)

Running DAG:
    -> read raw
  ingest: OK
    -> clean
  transform: OK
  qa_check: FAILED (bad rows)
  publish: SKIPPED (parent failed)
  alert: SKIPPED (parent failed)

Trace the propagation: qa_check fails, and both its descendants — publish and alert — skip, because this toy uses the simplest rule (skip if any parent failed). Real Workflows is richer: a run_if: AT_LEAST_ONE_FAILED on alert would make it fire because the parent failed, which is how you wire up failure notifications. But the kernel is exactly this — topological execution with status propagation; Databricks adds retries, parameters, cluster management, and a UI on top.

In one breath

A production job is reviewable, schedulable, and self-healing — which means it lives in code, not in a notebook UI. Package the logic as a wheel (a real src/ package with tests) rather than a notebook job, then declare the whole thing in an Asset Bundle (databricks.yml): a DAG of tasks wired by depends_on, a schedule, a shared job cluster to pay startup once, retries for transient failures, and a run_if: AT_LEAST_ONE_FAILED task for alerting. Deploy it with databricks bundle deploy --target prod, and the YAML — not the UI — becomes the source of truth. The mental kernel is a topological scheduler with status propagation; everything Databricks adds is production polish on that core.

Practice

Before the quiz, design the DAG: a pipeline ingests raw events, transforms them, runs a data-quality check, and publishes to BI — but only if the check passes — and Slacks the team if anything fails. Sketch the task_keys, the depends_on edges, and which task needs run_if: AT_LEAST_ONE_FAILED. Where do you put max_retries, and on which task would a retry be dangerous without an idempotent write?

Quick check

0/3

Q1Why prefer wheel jobs over notebook jobs for production?

Q2What's the role of `databricks.yml` (an Asset Bundle)?

Q3Three tasks: ingest -> transform -> notify. You set `max_retries: 3` on `transform`. The `transform` task fails on attempt 2 but succeeds on attempt 3 — what happens?

A question to carry forward

You can now write PySpark, store it reliably in Delta, and schedule it as a job that retries and alerts. But every pipeline we’ve scheduled so far ends by writing a table — rows in, rows out. What happens when the thing the job produces isn’t a table but a trained model? A model needs more than storage: it needs its metrics tracked across hundreds of experiments, its versions governed like any other UC object, and a way to serve predictions in milliseconds — not as a nightly batch. Does the same platform that ran your ETL also close the loop from training to live serving, and what does it take off your plate? That is MLflow on Databricks, and it is the next lesson — the last in this section.

Databricks Jobs — productionizing your PySpark code

What you'll learn

Before you start