datarekha
Data Engineering Medium Asked at DatabricksAsked at AmazonAsked at GoogleAsked at SnowflakeAsked at Netflix

Compare Parquet, CSV, and Avro as big-data file formats — when do you use each?

The short answer

Parquet is a columnar, compressed format optimized for analytical reads — only the queried columns are scanned. Avro is row-oriented, schema-embedded, and optimized for write-heavy pipelines and Kafka serialization. CSV is human-readable but schema-less, uncompressed, and slow at scale — use it only at system boundaries where a downstream tool requires it.

How to think about it

File format choice has a direct impact on query latency, storage cost, and pipeline complexity. There is no universally best format — the right answer depends on access pattern.

Parquet — columnar, compressed, read-optimized

Parquet stores data column-by-column. A query that reads 3 of 100 columns only scans 3% of the file. Each column is encoded (dictionary, RLE, delta) and compressed (Snappy, Zstandard, gzip). Column-level min/max statistics enable predicate push-down: Spark can skip entire row groups without reading them.

# Spark + Parquet: only 'amount' and 'user_id' columns are read from disk
spark.read.parquet("transactions/") \
    .filter("amount > 100") \
    .select("user_id", "amount") \
    .groupBy("user_id").sum("amount")

Use Parquet for: data lakes, analytical queries, Hive/Spark/Presto/Athena workloads, Delta Lake and Iceberg tables.

Avro — row-oriented, schema-embedded, write-optimized

Avro writes data row-by-row and embeds the schema (JSON-defined) in the file header. Reads are sequential — great for streaming systems. Kafka’s Schema Registry is built around Avro. Avro handles schema evolution (add/remove nullable fields) cleanly.

df.write.format("avro").save("events_avro/")
spark.read.format("avro").load("events_avro/")

Use Avro for: Kafka message serialization, event streaming pipelines, write-heavy ingestion layers, any system that needs reliable schema evolution.

CSV — text, human-readable, no schema

CSV has no types, no compression, and no statistics. Every query scans every byte. But it is universally understood — any tool from Excel to awk to pandas can read it.

df.write.option("header", True).csv("export_for_excel/")

Use CSV for: data exports to non-Spark consumers, small reference files, system boundaries with third-party tools that cannot handle binary formats.

Summary comparison

ParquetAvroCSV
OrientationColumnarRowRow
SchemaIn file footerIn file headerNone
CompressionColumn-levelBlock-levelNone (or gzip)
Column pruningYesNoNo
Predicate push-downYesNoNo
Schema evolutionLimitedExcellentN/A
Best forAnalyticsStreaming/ingestExport/interop

Keep practising

All Data Engineering questions

Explore further

Skip to content