Compare Parquet, CSV, and Avro as big-data file formats — when do you use each?
Parquet is a columnar, compressed format optimized for analytical reads — only the queried columns are scanned. Avro is row-oriented, schema-embedded, and optimized for write-heavy pipelines and Kafka serialization. CSV is human-readable but schema-less, uncompressed, and slow at scale — use it only at system boundaries where a downstream tool requires it.
How to think about it
File format choice has a direct impact on query latency, storage cost, and pipeline complexity. There is no universally best format — the right answer depends on access pattern.
Parquet — columnar, compressed, read-optimized
Parquet stores data column-by-column. A query that reads 3 of 100 columns only scans 3% of the file. Each column is encoded (dictionary, RLE, delta) and compressed (Snappy, Zstandard, gzip). Column-level min/max statistics enable predicate push-down: Spark can skip entire row groups without reading them.
# Spark + Parquet: only 'amount' and 'user_id' columns are read from disk
spark.read.parquet("transactions/") \
.filter("amount > 100") \
.select("user_id", "amount") \
.groupBy("user_id").sum("amount")
Use Parquet for: data lakes, analytical queries, Hive/Spark/Presto/Athena workloads, Delta Lake and Iceberg tables.
Avro — row-oriented, schema-embedded, write-optimized
Avro writes data row-by-row and embeds the schema (JSON-defined) in the file header. Reads are sequential — great for streaming systems. Kafka’s Schema Registry is built around Avro. Avro handles schema evolution (add/remove nullable fields) cleanly.
df.write.format("avro").save("events_avro/")
spark.read.format("avro").load("events_avro/")
Use Avro for: Kafka message serialization, event streaming pipelines, write-heavy ingestion layers, any system that needs reliable schema evolution.
CSV — text, human-readable, no schema
CSV has no types, no compression, and no statistics. Every query scans every byte. But it is universally understood — any tool from Excel to awk to pandas can read it.
df.write.option("header", True).csv("export_for_excel/")
Use CSV for: data exports to non-Spark consumers, small reference files, system boundaries with third-party tools that cannot handle binary formats.
Summary comparison
| Parquet | Avro | CSV | |
|---|---|---|---|
| Orientation | Columnar | Row | Row |
| Schema | In file footer | In file header | None |
| Compression | Column-level | Block-level | None (or gzip) |
| Column pruning | Yes | No | No |
| Predicate push-down | Yes | No | No |
| Schema evolution | Limited | Excellent | N/A |
| Best for | Analytics | Streaming/ingest | Export/interop |