Question 1

What is the difference between ETL and ELT, and when should you choose each?

Accepted Answer

ETL transforms data before loading it into the destination, which was necessary when warehouses were expensive and compute-constrained. ELT loads raw data first and transforms inside the warehouse, leveraging cheap cloud compute and making raw data available for reprocessing. ELT is the default in modern cloud stacks; ETL still makes sense when you must mask sensitive fields before they ever land in the warehouse.

Question 2

What is lazy evaluation in Spark, and how does it differ from transformations vs actions?

Accepted Answer

Spark does not execute any computation when you call a transformation — it builds a DAG of logical steps. Only when you call an action does Spark compile that DAG into physical tasks and execute them. This design lets Catalyst optimize the full query plan before touching any data.

Question 3

Should you normalize or denormalize tables in a data warehouse, and why?

Accepted Answer

Data warehouses favor denormalization — wide, flat tables that trade storage for query simplicity and performance. Normalization (splitting tables to eliminate redundancy) reduces storage but multiplies join hops, increasing query complexity and optimizer cost. In columnar warehouses with compression, the storage cost of redundancy is negligible, so denormalized star schemas consistently outperform normalized models for analytical workloads.

Question 4

What is the difference between OLTP and OLAP systems, and why can't you run analytics on your production database?

Accepted Answer

OLTP (Online Transaction Processing) systems handle high-throughput, low-latency reads and writes for individual records — think order placement, user authentication. OLAP (Online Analytical Processing) systems handle complex aggregations over millions of rows for business intelligence. Running heavy analytics directly on an OLTP database locks rows, competes for I/O, and slows application queries that customers feel.

Question 5

When should you use Spark instead of pandas, and what are the key trade-offs?

Accepted Answer

pandas operates in-memory on a single machine, making it fast and simple for datasets under a few gigabytes. Spark distributes computation across a cluster, handles terabyte-scale data, and integrates with cloud storage — but adds significant overhead for small data. The crossover point is roughly when your data no longer fits in RAM or when processing time on a single machine becomes unacceptable.

Question 6

What is the difference between a star schema and a snowflake schema in dimensional modeling?

Accepted Answer

A star schema has a central fact table joined directly to denormalized dimension tables — one join hop per dimension, simple queries, better query performance. A snowflake schema normalizes dimension tables into sub-dimensions, reducing storage redundancy but requiring more joins. Star schemas are preferred for analytics workloads; snowflake schemas are sometimes used when a dimension is very large and has many redundant attribute values.

Question 7

What are the differences between a data warehouse, a data lake, and a data lakehouse?

Accepted Answer

A data warehouse stores structured, schema-on-write data optimized for SQL analytics but is expensive for raw or unstructured data. A data lake stores any format cheaply on object storage but lacks ACID transactions and query performance. A lakehouse layers open table formats (Delta Lake, Iceberg, Hudi) on object storage to deliver warehouse-grade performance and ACID semantics at data lake costs — it is the dominant architecture in 2026.

Question 8

What is the difference between batch and streaming data pipelines, and how do you choose between them?

Accepted Answer

Batch pipelines process data in bounded chunks on a schedule — simple to build and test, but latency is measured in hours or days. Streaming pipelines process records continuously as they arrive — latency drops to seconds or milliseconds, but correctness requires handling late arrivals, watermarks, and stateful aggregations. Choose streaming when business decisions need fresh data; choose batch when daily freshness is acceptable and operational simplicity matters.

Question 9

What is a broadcast join in Spark and when should you use it?

Accepted Answer

A broadcast join sends a complete copy of the smaller table to every executor, so the join is done locally without any shuffle. It is the most effective single optimization for joins where one side is small enough to fit in executor memory, eliminating the most expensive network operation in a join.

Question 10

How does columnar storage work, and how does partitioning improve query performance in a data warehouse?

Accepted Answer

Columnar storage colocates values from the same column on disk, so aggregation queries read only the columns they need rather than full rows — dramatically reducing I/O on wide tables. Partitioning physically separates data into subdirectories (e.g., by date), allowing the query engine to skip entire partitions whose predicate cannot match, cutting scan volume from the full table to just the relevant slice.

Question 11

How does Apache Airflow work, and what is a DAG backfill?

Accepted Answer

Airflow models pipelines as Directed Acyclic Graphs (DAGs) of tasks, each with defined dependencies. The scheduler triggers DAG runs based on a cron schedule, passing each run a logical execution date rather than the wall-clock time. A backfill re-runs a DAG over a historical date range, allowing you to populate data for past periods after adding a new pipeline or fixing a bug — as long as tasks are idempotent.

Question 12

What are data quality checks and data contracts, and how do you enforce them in a modern data stack?

Accepted Answer

Data quality checks assert that datasets meet defined expectations — completeness, uniqueness, referential integrity, value ranges — and fail the pipeline or alert when they do not. Data contracts are formal, version-controlled agreements between data producers and consumers specifying schema, semantics, and SLAs, preventing silent breaking changes from propagating downstream.

Question 13

What does idempotency mean for a data pipeline, and how do you make a pipeline idempotent?

Accepted Answer

An idempotent pipeline produces the same output no matter how many times it runs for the same logical window — rerunning it on an already-processed date partition yields identical results rather than duplicated rows. Achieving idempotency typically means using INSERT OVERWRITE (or MERGE) instead of plain INSERT, keying every record with a deterministic ID, and deleting-then-inserting the target partition before writing.

Question 14

What are the core concepts of Apache Kafka and how does it guarantee message durability and ordering?

Accepted Answer

Kafka is a distributed, append-only log. Producers write messages to topics, which are split into partitions stored on brokers. Consumers read from partitions at their own pace using offsets. Durability is guaranteed by replication across brokers; ordering is guaranteed within a single partition but not across partitions.

Question 15

What is the difference between narrow and wide transformations in Spark?

Accepted Answer

Narrow transformations compute each output partition using data from exactly one input partition — no data moves across the network. Wide transformations require data from multiple input partitions, forcing a shuffle across the network, which is the most expensive operation in a Spark job.

Question 16

What is the difference between a message queue and an event stream?

Accepted Answer

A queue usually distributes work among competing consumers: a worker receives or leases a message, acknowledges success, and retries failure. A stream retains an ordered event log for a retention period; consumers track offsets and independent consumer groups can replay the same events. Choose by retention, replay, ordering and consumer semantics rather than product name.

Question 17

What is the difference between an RDD, a DataFrame, and a Dataset in Spark?

Accepted Answer

RDD is the low-level, type-safe distributed collection with no schema knowledge. DataFrame adds a named-column schema on top, enabling the Catalyst optimizer and codegen — but loses compile-time type safety. Dataset merges both worlds: it carries a schema and passes through Catalyst while remaining statically typed in Scala/Java.

Question 18

What is the difference between repartition and coalesce in Spark?

Accepted Answer

repartition triggers a full shuffle to produce exactly N evenly distributed partitions and can both increase and decrease partition count. coalesce merges existing partitions on the same or nearby executors without a shuffle, but can only decrease partition count and may produce uneven partitions.

Question 19

How do you handle schema evolution in data pipelines without breaking downstream consumers?

Accepted Answer

Schema evolution covers adding, renaming, removing, or retyping columns in a data stream or table over time. Safe strategies include: only adding nullable columns (backwards-compatible), using schema registries to enforce compatibility rules before a producer publishes, and open table formats like Iceberg that track schema history and allow column renames and reorders without rewriting data.

Question 20

What are slowly changing dimensions, and how do Type 1 and Type 2 differ?

Accepted Answer

Slowly changing dimensions (SCDs) handle attributes that change over time — a customer moving cities, a product changing category. Type 1 overwrites the old value and loses history. Type 2 inserts a new row with effective and expiry dates, preserving the full history of what was true at any point in time. Type 2 is the standard when accurate historical reporting matters.

Question 21

How does caching and persist work in Spark, and when should you use each storage level?

Accepted Answer

cache() stores a DataFrame in executor memory using the default MEMORY_AND_DISK storage level. persist() lets you choose the storage level — memory-only, disk-only, serialized, or replicated. Use caching when a DataFrame is reused multiple times in the same application; without it, Spark recomputes the entire lineage from scratch on each action.

Question 22

Explain the Spark driver/executor model and what each component does.

Accepted Answer

The driver is a single JVM process that hosts the SparkContext, builds the DAG, schedules tasks, and coordinates results. Executors are JVM processes on worker nodes that actually run tasks and cache data. The cluster manager (YARN, Kubernetes, standalone) sits between them, allocating resources.

Question 23

What is a shuffle in Spark and why is it expensive?

Accepted Answer

A shuffle redistributes data across all executors so that rows with the same key end up on the same partition. It involves writing intermediate data to disk, transferring it over the network, and re-reading it — making it the most costly operation in a Spark job in terms of latency and I/O.

Question 24

What is Change Data Capture (CDC) and how is it implemented?

Accepted Answer

CDC continuously captures row-level inserts, updates, and deletes from a source database and streams them downstream — enabling near-real-time replication to a warehouse or data lake without full table scans. The most robust implementation reads the database's write-ahead log (WAL), making it low-impact on the source and capable of capturing deletes that polling-based approaches miss entirely.

Question 25

Compare Parquet, CSV, and Avro as big-data file formats — when do you use each?

Accepted Answer

Parquet is a columnar, compressed format optimized for analytical reads — only the queried columns are scanned. Avro is row-oriented, schema-embedded, and optimized for write-heavy pipelines and Kafka serialization. CSV is human-readable but schema-less, uncompressed, and slow at scale — use it only at system boundaries where a downstream tool requires it.

Question 26

How does the Spark Catalyst optimizer work, and what does Adaptive Query Execution add?

Accepted Answer

Catalyst is a rule-based and cost-based query optimizer that transforms a logical plan through four phases — analysis, logical optimization, physical planning, and code generation — before any data is touched. Adaptive Query Execution (AQE), introduced in Spark 3, extends this by re-optimizing the physical plan at runtime using actual shuffle statistics rather than stale estimates.

Question 27

What is data skew in Spark and how do you fix it with salting?

Accepted Answer

Data skew occurs when one or more keys concentrate disproportionately more rows than others, causing a few tasks to process gigabytes while the rest finish in seconds — one slow task stalls the entire stage. Salting appends a random suffix to skewed keys before a join or aggregation, spreading the hot key across multiple partitions, then removes the salt after aggregation.

Question 28

What does exactly-once processing mean in streaming pipelines, and how is it achieved?

Accepted Answer

Exactly-once guarantees that each input record affects the output exactly one time — no duplicates from retries, no gaps from dropped messages. In practice it requires coordination between the messaging system, the processing engine, and the sink: the processor must checkpoint its position and write output atomically, so that a restart replays from the last checkpoint without re-emitting already-written results.

Question 29

What causes out-of-memory errors in Spark and how do you diagnose and fix them?

Accepted Answer

Spark OOM errors fall into two categories: driver OOM (usually from collect() or large broadcast tables) and executor OOM (from insufficient heap for task execution, shuffle buffers, or cached data). Diagnosing requires reading the Spark UI event log to identify which stage failed and whether the failure is in storage, execution, or user memory.