Databricks Jobs — productionizing your PySpark code
Notebooks are great for prototyping, terrible for production. Here's how teams actually ship Spark jobs on Databricks — Asset Bundles, task DAGs, and CI/CD.
What you'll learn
- Notebook jobs vs wheel/JAR jobs — the real trade-off
- Databricks Asset Bundles (databricks.yml) — the current IaC standard
- Task DAGs, parameter passing, conditional runs
- Cluster reuse, retries, and on-failure handlers
Before you start
A notebook that works once is not a production job. A production job is one that runs every night at 2am, retries on failure, alerts the right people, depends on three upstream tables and feeds two downstream ones, and is reviewable in a pull request before it ships.
Databricks calls this layer Workflows (the runtime) and Jobs (the things being run). And as of 2024, the canonical way to manage them is Asset Bundles — a YAML-based IaC that you check into Git alongside your code.
Two patterns: notebook jobs and wheel jobs
Every Databricks job is one of two shapes:
Notebook job — point the job at a notebook in the workspace. The job clones the notebook into a run, executes top-to-bottom, captures output. Fast to set up, terrible for version control: the notebook lives in the workspace UI, not in your repo (unless you use Repos / Git Folders).
Wheel job — package your Python code as a wheel, install it on
the cluster with pip install, and call an entry point. Your code
lives in src/, has tests, ships through CI. This is the only
sustainable pattern for anything you’ll maintain longer than a quarter.
The same dichotomy applies to JVM jobs (JAR) and SQL jobs (a .sql
file or query reference). For PySpark, wheels win.
# src/my_pipeline/transform.py
from pyspark.sql import SparkSession, functions as F
def run(date: str, output_table: str):
spark = SparkSession.builder.getOrCreate()
df = (spark.read.table("main.raw.events")
.filter(F.col("event_date") == date)
.groupBy("country")
.agg(F.count("*").alias("events"),
F.countDistinct("user_id").alias("users")))
df.write.mode("overwrite").saveAsTable(output_table)
if __name__ == "__main__":
import sys
run(sys.argv[1], sys.argv[2])
Notice: a real function with arguments. Unit-testable. Importable from a notebook for debugging. This is what production looks like.
Asset Bundles — the IaC layer
A bundle is a directory with a databricks.yml file that declares
your jobs, clusters, and artifacts. You deploy the bundle with one
command; Databricks creates or updates everything to match.
A minimal bundle:
# databricks.yml
bundle:
name: my-pipeline
artifacts:
my_wheel:
type: whl
path: ./
variables:
catalog:
description: Target Unity Catalog catalog
default: dev
resources:
jobs:
daily_pipeline:
name: daily-pipeline-${var.catalog}
tasks:
- task_key: ingest
python_wheel_task:
package_name: my_pipeline
entry_point: ingest
parameters: ["${var.catalog}"]
new_cluster:
spark_version: "15.4.x-photon-scala2.12"
node_type_id: "i3.xlarge"
num_workers: 4
- task_key: transform
depends_on:
- task_key: ingest
python_wheel_task:
package_name: my_pipeline
entry_point: transform
parameters: ["${var.catalog}"]
new_cluster:
spark_version: "15.4.x-photon-scala2.12"
node_type_id: "i3.xlarge"
num_workers: 4
- task_key: notify
depends_on:
- task_key: transform
notebook_task:
notebook_path: ./notebooks/notify.py
schedule:
quartz_cron_expression: "0 0 2 * * ?"
timezone_id: "UTC"
email_notifications:
on_failure: ["data-platform@example.com"]
targets:
dev:
workspace:
host: https://dev.cloud.databricks.com
variables:
catalog: dev
prod:
workspace:
host: https://prod.cloud.databricks.com
variables:
catalog: main
Deploying is one command per target:
databricks bundle deploy --target dev
databricks bundle deploy --target prod
databricks bundle run daily_pipeline --target prod
The bundle builds your wheel, uploads it to the workspace, creates the job, and wires up the schedule. If you change the YAML and re-deploy, it diffs and updates. If you delete a resource from YAML, the next deploy removes it from the workspace. The YAML is the source of truth — clicking around in the UI to “fix” a deployed job is now a code smell.
Task DAGs
A job is a directed acyclic graph (DAG) of tasks. Each task has a
task_key and optional depends_on. The example above forms:
You can fan out — multiple tasks depending on one parent — or fan in. The scheduler runs tasks in topological order, parallelizing independent branches. If a task fails, downstream tasks are skipped unless they’re marked to run anyway.
A common production pattern is medallion — bronze ingest, silver clean, gold aggregate, then publish:
The data_quality_check runs alongside gold_agg; the alert task
runs only if data_quality_check fails (run_if: AT_LEAST_ONE_FAILED).
Parameter passing
Tasks need to share context — a partition date, a run ID, a target catalog. Two mechanisms:
Job parameters are set at job-level and visible to all tasks via
${{job.parameters.X}} substitution:
parameters:
- name: run_date
default: "{{job.start_time.iso_date}}"
tasks:
- task_key: ingest
python_wheel_task:
parameters: ["{{job.parameters.run_date}}"]
Task values let a task emit a value that downstream tasks read:
# In the upstream task
from databricks.sdk.runtime import dbutils
dbutils.jobs.taskValues.set("row_count", str(df.count()))
# In a downstream task
count = dbutils.jobs.taskValues.get(
taskKey="ingest", key="row_count", default="0"
)
The task values are the equivalent of Airflow XComs (cross-task communication slots) — small bits of metadata that flow along the DAG edges.
Cluster strategy
Each task can declare new_cluster (a job cluster — created at task
start, killed at end) or existing_cluster_id (use a long-running
all-purpose cluster). The trade-off:
| Pattern | When to use |
|---|---|
| Job cluster per task | Long tasks, isolated dependencies, lowest cost |
| Job cluster shared by tasks | Tasks share library versions; saves startup time |
| All-purpose cluster | Tasks need to be fast-to-start (small frequent jobs) |
For most pipelines, one job cluster shared by all tasks is the sweet spot. Cluster startup adds 2-4 minutes, so you don’t want it once per task; an all-purpose cluster idling burns DBUs even when nothing runs. Shared job cluster: pay once per run.
job_clusters:
- job_cluster_key: shared
new_cluster:
spark_version: "15.4.x-photon-scala2.12"
node_type_id: "i3.xlarge"
num_workers: 4
tasks:
- task_key: ingest
job_cluster_key: shared
# ...
- task_key: transform
job_cluster_key: shared
# ...
Retries and failure handling
Production tasks need retries (transient network failures, S3 5xx, cluster scaling issues). Add at the task level:
tasks:
- task_key: ingest
max_retries: 3
min_retry_interval_millis: 120000 # 2 min between retries
retry_on_timeout: true
For domain-specific failures (data quality, schema drift) you want a failure task — a task that only runs if its parent failed:
- task_key: post_to_slack
depends_on:
- task_key: ingest
run_if: AT_LEAST_ONE_FAILED
notebook_task:
notebook_path: ./notebooks/slack_alert.py
Combine with on-failure email/PagerDuty hooks at the job level for defense in depth.
A toy task scheduler
The DAG model is just topological sort plus a state machine. Here’s the shape, in Python:
That’s the model. Databricks Workflows adds retries, parameter passing, cluster management, and a UI on top — but the kernel is topological execution with status propagation.
Quick check
Quick check
Next
You can write PySpark, store it in Delta, and schedule it as a job. The last piece is the ML side of Databricks — how MLflow integrates into the workspace, how models register into Unity Catalog, and how to deploy a model behind a serving endpoint.
Practice this in an interview
All questionspandas operates in-memory on a single machine, making it fast and simple for datasets under a few gigabytes. Spark distributes computation across a cluster, handles terabyte-scale data, and integrates with cloud storage — but adds significant overhead for small data. The crossover point is roughly when your data no longer fits in RAM or when processing time on a single machine becomes unacceptable.
RDD is the low-level, type-safe distributed collection with no schema knowledge. DataFrame adds a named-column schema on top, enabling the Catalyst optimizer and codegen — but loses compile-time type safety. Dataset merges both worlds: it carries a schema and passes through Catalyst while remaining statically typed in Scala/Java.
The driver is a single JVM process that hosts the SparkContext, builds the DAG, schedules tasks, and coordinates results. Executors are JVM processes on worker nodes that actually run tasks and cache data. The cluster manager (YARN, Kubernetes, standalone) sits between them, allocating resources.
Parquet is a columnar, compressed format optimized for analytical reads — only the queried columns are scanned. Avro is row-oriented, schema-embedded, and optimized for write-heavy pipelines and Kafka serialization. CSV is human-readable but schema-less, uncompressed, and slow at scale — use it only at system boundaries where a downstream tool requires it.