Interview prep
Data Engineering interview questions
28 of the most common Data Engineering questions for data and AI interviews — each with a worked answer, the trap to avoid, and a link to learn it properly. Pipelines, Spark, warehousing, data modeling.
Filter by role
- What is the difference between ETL and ELT, and when should you choose each? Easy ·Snowflake·Databricks·dbt Labs
- What is lazy evaluation in Spark, and how does it differ from transformations vs actions? Easy ·Amazon·Databricks·Microsoft
- Should you normalize or denormalize tables in a data warehouse, and why? Easy ·Snowflake·BigQuery·dbt Labs
- What is the difference between OLTP and OLAP systems, and why can't you run analytics on your production database? Easy ·PostgreSQL·Oracle·Snowflake
- When should you use Spark instead of pandas, and what are the key trade-offs? Easy ·Amazon·Google·Meta
- What is the difference between a star schema and a snowflake schema in dimensional modeling? Easy ·Kimball Group·dbt Labs·Snowflake
- What are the differences between a data warehouse, a data lake, and a data lakehouse? Easy ·Snowflake·Databricks·Google
- What is the difference between batch and streaming data pipelines, and how do you choose between them? Medium ·Kafka·Google·Meta
- What is a broadcast join in Spark and when should you use it? Medium ·Databricks·Amazon·Google
- How does columnar storage work, and how does partitioning improve query performance in a data warehouse? Medium ·Snowflake·BigQuery·Databricks
- How does Apache Airflow work, and what is a DAG backfill? Medium ·Airflow·Astronomer·Google
- What are data quality checks and data contracts, and how do you enforce them in a modern data stack? Medium ·dbt Labs·Great Expectations·Monte Carlo
- What does idempotency mean for a data pipeline, and how do you make a pipeline idempotent? Medium ·Airflow·Snowflake·Databricks
- What are the core concepts of Apache Kafka and how does it guarantee message durability and ordering? Medium ·LinkedIn·Uber·Airbnb
- What is the difference between narrow and wide transformations in Spark? Medium ·Databricks·Amazon·Google
- What is the difference between an RDD, a DataFrame, and a Dataset in Spark? Medium ·Databricks·Amazon·Google
- What is the difference between repartition and coalesce in Spark? Medium ·Databricks·Amazon·Airbnb
- How do you handle schema evolution in data pipelines without breaking downstream consumers? Medium ·Confluent·Databricks·Snowflake
- What are slowly changing dimensions, and how do Type 1 and Type 2 differ? Medium ·dbt Labs·Snowflake·Databricks
- How does caching and persist work in Spark, and when should you use each storage level? Medium ·Databricks·Amazon·Netflix
- Explain the Spark driver/executor model and what each component does. Medium ·Databricks·Amazon·Google
- What is a shuffle in Spark and why is it expensive? Medium ·Databricks·Amazon·Netflix
- What is Change Data Capture (CDC) and how is it implemented? Medium ·Debezium·Confluent·Fivetran
- Compare Parquet, CSV, and Avro as big-data file formats — when do you use each? Medium ·Databricks·Amazon·Google
- How does the Spark Catalyst optimizer work, and what does Adaptive Query Execution add? Hard ·Databricks·Amazon·Google
- What is data skew in Spark and how do you fix it with salting? Hard ·Databricks·Amazon·Meta
- What does exactly-once processing mean in streaming pipelines, and how is it achieved? Hard ·Kafka·Flink·Google
- What causes out-of-memory errors in Spark and how do you diagnose and fix them? Hard ·Databricks·Amazon·Netflix
No questions tagged for that role yet.