datarekha

Databricks Jobs — productionizing your PySpark code

Notebooks are great for prototyping, terrible for production. Here's how teams actually ship Spark jobs on Databricks — Asset Bundles, task DAGs, and CI/CD.

10 min read Intermediate PySpark Lesson 21 of 22

What you'll learn

  • Notebook jobs vs wheel/JAR jobs — the real trade-off
  • Databricks Asset Bundles (databricks.yml) — the current IaC standard
  • Task DAGs, parameter passing, conditional runs
  • Cluster reuse, retries, and on-failure handlers

Before you start

A notebook that works once is not a production job. A production job is one that runs every night at 2am, retries on failure, alerts the right people, depends on three upstream tables and feeds two downstream ones, and is reviewable in a pull request before it ships.

Databricks calls this layer Workflows (the runtime) and Jobs (the things being run). And as of 2024, the canonical way to manage them is Asset Bundles — a YAML-based IaC that you check into Git alongside your code.

Two patterns: notebook jobs and wheel jobs

Every Databricks job is one of two shapes:

Notebook job — point the job at a notebook in the workspace. The job clones the notebook into a run, executes top-to-bottom, captures output. Fast to set up, terrible for version control: the notebook lives in the workspace UI, not in your repo (unless you use Repos / Git Folders).

Wheel job — package your Python code as a wheel, install it on the cluster with pip install, and call an entry point. Your code lives in src/, has tests, ships through CI. This is the only sustainable pattern for anything you’ll maintain longer than a quarter.

The same dichotomy applies to JVM jobs (JAR) and SQL jobs (a .sql file or query reference). For PySpark, wheels win.

# src/my_pipeline/transform.py
from pyspark.sql import SparkSession, functions as F

def run(date: str, output_table: str):
    spark = SparkSession.builder.getOrCreate()
    df = (spark.read.table("main.raw.events")
            .filter(F.col("event_date") == date)
            .groupBy("country")
            .agg(F.count("*").alias("events"),
                 F.countDistinct("user_id").alias("users")))
    df.write.mode("overwrite").saveAsTable(output_table)

if __name__ == "__main__":
    import sys
    run(sys.argv[1], sys.argv[2])

Notice: a real function with arguments. Unit-testable. Importable from a notebook for debugging. This is what production looks like.

Asset Bundles — the IaC layer

A bundle is a directory with a databricks.yml file that declares your jobs, clusters, and artifacts. You deploy the bundle with one command; Databricks creates or updates everything to match.

A minimal bundle:

# databricks.yml
bundle:
  name: my-pipeline

artifacts:
  my_wheel:
    type: whl
    path: ./

variables:
  catalog:
    description: Target Unity Catalog catalog
    default: dev

resources:
  jobs:
    daily_pipeline:
      name: daily-pipeline-${var.catalog}
      tasks:
        - task_key: ingest
          python_wheel_task:
            package_name: my_pipeline
            entry_point: ingest
            parameters: ["${var.catalog}"]
          new_cluster:
            spark_version: "15.4.x-photon-scala2.12"
            node_type_id: "i3.xlarge"
            num_workers: 4

        - task_key: transform
          depends_on:
            - task_key: ingest
          python_wheel_task:
            package_name: my_pipeline
            entry_point: transform
            parameters: ["${var.catalog}"]
          new_cluster:
            spark_version: "15.4.x-photon-scala2.12"
            node_type_id: "i3.xlarge"
            num_workers: 4

        - task_key: notify
          depends_on:
            - task_key: transform
          notebook_task:
            notebook_path: ./notebooks/notify.py

      schedule:
        quartz_cron_expression: "0 0 2 * * ?"
        timezone_id: "UTC"

      email_notifications:
        on_failure: ["data-platform@example.com"]

targets:
  dev:
    workspace:
      host: https://dev.cloud.databricks.com
    variables:
      catalog: dev
  prod:
    workspace:
      host: https://prod.cloud.databricks.com
    variables:
      catalog: main

Deploying is one command per target:

databricks bundle deploy --target dev
databricks bundle deploy --target prod
databricks bundle run daily_pipeline --target prod

The bundle builds your wheel, uploads it to the workspace, creates the job, and wires up the schedule. If you change the YAML and re-deploy, it diffs and updates. If you delete a resource from YAML, the next deploy removes it from the workspace. The YAML is the source of truth — clicking around in the UI to “fix” a deployed job is now a code smell.

Task DAGs

A job is a directed acyclic graph (DAG) of tasks. Each task has a task_key and optional depends_on. The example above forms:

ingesttransformnotify

You can fan out — multiple tasks depending on one parent — or fan in. The scheduler runs tasks in topological order, parallelizing independent branches. If a task fails, downstream tasks are skipped unless they’re marked to run anyway.

A common production pattern is medallion — bronze ingest, silver clean, gold aggregate, then publish:

bronze_ingestsilver_cleangold_aggpublish_to_bidata_quality_checkalerton failure

The data_quality_check runs alongside gold_agg; the alert task runs only if data_quality_check fails (run_if: AT_LEAST_ONE_FAILED).

Parameter passing

Tasks need to share context — a partition date, a run ID, a target catalog. Two mechanisms:

Job parameters are set at job-level and visible to all tasks via ${{job.parameters.X}} substitution:

parameters:
  - name: run_date
    default: "{{job.start_time.iso_date}}"

tasks:
  - task_key: ingest
    python_wheel_task:
      parameters: ["{{job.parameters.run_date}}"]

Task values let a task emit a value that downstream tasks read:

# In the upstream task
from databricks.sdk.runtime import dbutils
dbutils.jobs.taskValues.set("row_count", str(df.count()))

# In a downstream task
count = dbutils.jobs.taskValues.get(
    taskKey="ingest", key="row_count", default="0"
)

The task values are the equivalent of Airflow XComs (cross-task communication slots) — small bits of metadata that flow along the DAG edges.

Cluster strategy

Each task can declare new_cluster (a job cluster — created at task start, killed at end) or existing_cluster_id (use a long-running all-purpose cluster). The trade-off:

PatternWhen to use
Job cluster per taskLong tasks, isolated dependencies, lowest cost
Job cluster shared by tasksTasks share library versions; saves startup time
All-purpose clusterTasks need to be fast-to-start (small frequent jobs)

For most pipelines, one job cluster shared by all tasks is the sweet spot. Cluster startup adds 2-4 minutes, so you don’t want it once per task; an all-purpose cluster idling burns DBUs even when nothing runs. Shared job cluster: pay once per run.

job_clusters:
  - job_cluster_key: shared
    new_cluster:
      spark_version: "15.4.x-photon-scala2.12"
      node_type_id: "i3.xlarge"
      num_workers: 4

tasks:
  - task_key: ingest
    job_cluster_key: shared
    # ...
  - task_key: transform
    job_cluster_key: shared
    # ...

Retries and failure handling

Production tasks need retries (transient network failures, S3 5xx, cluster scaling issues). Add at the task level:

tasks:
  - task_key: ingest
    max_retries: 3
    min_retry_interval_millis: 120000   # 2 min between retries
    retry_on_timeout: true

For domain-specific failures (data quality, schema drift) you want a failure task — a task that only runs if its parent failed:

- task_key: post_to_slack
  depends_on:
    - task_key: ingest
  run_if: AT_LEAST_ONE_FAILED
  notebook_task:
    notebook_path: ./notebooks/slack_alert.py

Combine with on-failure email/PagerDuty hooks at the job level for defense in depth.

A toy task scheduler

The DAG model is just topological sort plus a state machine. Here’s the shape, in Python:

That’s the model. Databricks Workflows adds retries, parameter passing, cluster management, and a UI on top — but the kernel is topological execution with status propagation.

Quick check

Quick check

0/3
Q1Why prefer wheel jobs over notebook jobs for production?
Q2What's the role of `databricks.yml` (an Asset Bundle)?
Q3Three tasks: ingest -> transform -> notify. You set `max_retries: 3` on `transform`. The `transform` task fails on attempt 2 but succeeds on attempt 3 — what happens?

Next

You can write PySpark, store it in Delta, and schedule it as a job. The last piece is the ML side of Databricks — how MLflow integrates into the workspace, how models register into Unity Catalog, and how to deploy a model behind a serving endpoint.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Practice this in an interview

All questions
When should you use Spark instead of pandas, and what are the key trade-offs?

pandas operates in-memory on a single machine, making it fast and simple for datasets under a few gigabytes. Spark distributes computation across a cluster, handles terabyte-scale data, and integrates with cloud storage — but adds significant overhead for small data. The crossover point is roughly when your data no longer fits in RAM or when processing time on a single machine becomes unacceptable.

What is the difference between an RDD, a DataFrame, and a Dataset in Spark?

RDD is the low-level, type-safe distributed collection with no schema knowledge. DataFrame adds a named-column schema on top, enabling the Catalyst optimizer and codegen — but loses compile-time type safety. Dataset merges both worlds: it carries a schema and passes through Catalyst while remaining statically typed in Scala/Java.

Explain the Spark driver/executor model and what each component does.

The driver is a single JVM process that hosts the SparkContext, builds the DAG, schedules tasks, and coordinates results. Executors are JVM processes on worker nodes that actually run tasks and cache data. The cluster manager (YARN, Kubernetes, standalone) sits between them, allocating resources.

Compare Parquet, CSV, and Avro as big-data file formats — when do you use each?

Parquet is a columnar, compressed format optimized for analytical reads — only the queried columns are scanned. Avro is row-oriented, schema-embedded, and optimized for write-heavy pipelines and Kafka serialization. CSV is human-readable but schema-less, uncompressed, and slow at scale — use it only at system boundaries where a downstream tool requires it.

Related lessons

Explore further

Skip to content