datarekha

Databricks — where Spark actually runs

The managed Spark platform that most data teams now deploy to. Workspaces, clusters, Unity Catalog, Photon — and an honest look at when it's worth the bill.

9 min read Intermediate PySpark Lesson 19 of 22

What you'll learn

  • The five pieces of the Databricks stack and how they connect
  • Why Photon makes Spark SQL 2-3x faster without rewriting code
  • Notebook vs SQL warehouse vs job — pick the right runtime
  • When Databricks is worth the price (and when EMR/Glue is enough)

Before you start

If you join a data team today and they say “we run on Spark,” the next sentence is almost always “…on Databricks.” Self-hosted EMR clusters and bare YARN setups still exist, but the center of gravity has moved. The reason is simple: Databricks bundles managed Spark, governance, ML, and BI into one product — a pattern Databricks calls a lakehouse (a data lake with database-quality reliability and performance, rather than separate lake + warehouse systems) — and the Photon engine has quietly made it the fastest place to run Spark SQL.

This lesson is the map. What’s in the box, what each piece does, and the honest cost trade-off.

The five pieces

A Databricks deployment has these moving parts. You’ll touch all of them within a week of joining a team:

PieceWhat it is
WorkspaceThe web UI — notebooks, files, jobs, dashboards
ComputeClusters (general-purpose) and SQL warehouses (BI-optimized)
WorkflowsThe job scheduler — DAGs of tasks
Unity CatalogGovernance — tables, permissions, lineage
Delta LakeThe storage format — Parquet with ACID on top

A Databricks account holds one or more workspaces (usually one per environment: dev / staging / prod). The account is where billing and identity live. The workspace is where you actually work. Unity Catalog spans across workspaces in the same account, which is what makes cross-environment governance possible.

Photon — the part that’s worth paying for

Open-source Spark uses a JVM-based execution engine. Databricks ships Photon, a vectorized query engine written in C++. Photon doesn’t replace Catalyst (the optimizer is the same), but when the physical plan can run on Photon, the per-row work happens in tight SIMD-friendly C++ loops instead of generated Java bytecode.

The payoff for the typical SQL or DataFrame job:

  • Aggregations and joins — 2-3x faster end-to-end
  • Window functions — often 3-5x faster
  • Parquet/Delta scans with predicate pushdown — close to wire speed

You opt in by checking “Use Photon” on the cluster config. Your code doesn’t change. The catch: Photon adds a surcharge (roughly 50-75% more DBUs than the same non-Photon cluster tier), so it pays off only when the speedup compensates. For ad-hoc exploration it usually does. For tiny jobs the overhead isn’t worth it.

# Same PySpark code. Photon makes the physical plan use the C++ engine.
from pyspark.sql import functions as F

orders = spark.read.table("main.sales.orders")

revenue = (orders
    .filter(F.col("status") == "PAID")
    .groupBy("country")
    .agg(F.sum(F.col("amount") - F.col("discount")).alias("revenue"))
    .orderBy(F.desc("revenue")))

revenue.write.mode("overwrite").saveAsTable("main.sales.revenue_by_country")

Nothing in that snippet says “Photon.” The cluster decides; Catalyst emits a Photon physical plan when it can.

Three runtimes, three personalities

Databricks gives you three places to run code. Each has its own sweet spot:

Notebooks run on a general-purpose cluster. They’re great for exploration, ML, and any job that mixes PySpark, SQL, and Python. The cell-based UI is your day-to-day.

SQL warehouses are clusters tuned for BI — fast startup, Photon on by default, integrated with the SQL Editor. You’d point Tableau, Power BI, or Looker at a SQL warehouse, not at a notebook cluster.

Jobs are scheduled or triggered runs of notebooks, Python wheels, or JARs. Production work lives here. You’ll learn the full pattern in the Databricks jobs lesson.

Unity Catalog — the governance layer

Pre-Unity Catalog, every workspace had its own Hive metastore. Tables were scoped to one workspace. Permissions were a mess. Lineage didn’t exist.

Unity Catalog (UC) replaced that. The three-level namespace is catalog.schema.table — and it’s the same across all workspaces in the account:

# A UC-qualified table reference. Works from any workspace in the account.
df = spark.read.table("main.sales.orders")

# Permissions live in UC, not in the cluster
spark.sql("GRANT SELECT ON main.sales.orders TO `analyst-group`")

UC gives you column-level lineage (which downstream tables touched orders.customer_id?), row-level security, audit logs, and one permission model for tables, models, and volumes (managed file storage). If you’re starting a Databricks workspace today, UC is the default — the old Hive metastore is deprecated.

The CLI — IaC for Databricks

Anything you can do in the UI you can do with the databricks CLI. You’ll use it to deploy code, manage clusters from CI, and ship Asset Bundles (the modern way to version-control jobs):

# Configure once
databricks configure --token

# List your workspaces
databricks workspace list /

# Run a job from CI
databricks jobs run-now --job-id 12345

# Deploy a bundle (see databricks-jobs lesson)
databricks bundle deploy --target prod

For anything past the prototype stage, you want jobs and clusters defined in YAML and deployed from CI — not clicked into existence in the UI.

The cost question

Databricks bills on DBUs (Databricks Units) — a normalized per-hour cluster cost — plus the underlying cloud VM cost. A DBU on a Photon job cluster runs roughly $0.30-0.55 depending on tier. A small Photon-on i3.xlarge cluster can easily burn $5-10/hour.

That’s expensive compared to running your own EMR or Glue jobs. The math works when:

  • You have heavy production ETL — Photon and AQE save more hours than you spend on DBUs
  • You need ML + BI + ETL on one platform — paying for three separate stacks (Snowflake + SageMaker + Airflow) usually costs more
  • You value governance — UC is a real moat over rolling your own
  • Your team is small — Databricks operates the platform, you don’t

Databricks is the wrong answer when:

  • Your workload is small (a few hundred GB) — Snowflake or BigQuery on serverless will be cheaper and simpler
  • You only need batch ETL — Glue or EMR Serverless can do it for less
  • You’re cost-sensitive in early stage — Databricks is hard to make cheap

Simulating the cluster model

The cluster ↔ workspace ↔ catalog model is just a layered name resolver — much like a filesystem. Here’s the shape, in pure Python:

That’s the mental model: clusters are the compute, Unity Catalog is the shared namespace, and any workspace pointed at the same UC sees the same tables with the same permissions.

Quick check

Quick check

0/3
Q1What does enabling Photon on a cluster do to your PySpark code?
Q2You need a low-latency BI dashboard pointed at a Delta table. Which Databricks runtime is the right pick?
Q3Why has Unity Catalog largely replaced the per-workspace Hive metastore?

Next

Now you know the platform. The next layer is the storage format that makes the rest of it possible — Delta Lake. ACID transactions on Parquet, time travel, and the MERGE operation that turned upsert from a nightmare into a one-liner.

Practice this in an interview

All questions
When should you use Spark instead of pandas, and what are the key trade-offs?

pandas operates in-memory on a single machine, making it fast and simple for datasets under a few gigabytes. Spark distributes computation across a cluster, handles terabyte-scale data, and integrates with cloud storage — but adds significant overhead for small data. The crossover point is roughly when your data no longer fits in RAM or when processing time on a single machine becomes unacceptable.

What is the difference between an RDD, a DataFrame, and a Dataset in Spark?

RDD is the low-level, type-safe distributed collection with no schema knowledge. DataFrame adds a named-column schema on top, enabling the Catalyst optimizer and codegen — but loses compile-time type safety. Dataset merges both worlds: it carries a schema and passes through Catalyst while remaining statically typed in Scala/Java.

Explain the Spark driver/executor model and what each component does.

The driver is a single JVM process that hosts the SparkContext, builds the DAG, schedules tasks, and coordinates results. Executors are JVM processes on worker nodes that actually run tasks and cache data. The cluster manager (YARN, Kubernetes, standalone) sits between them, allocating resources.

Compare Parquet, CSV, and Avro as big-data file formats — when do you use each?

Parquet is a columnar, compressed format optimized for analytical reads — only the queried columns are scanned. Avro is row-oriented, schema-embedded, and optimized for write-heavy pipelines and Kafka serialization. CSV is human-readable but schema-less, uncompressed, and slow at scale — use it only at system boundaries where a downstream tool requires it.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Explore further

Related lessons

Skip to content