What is the difference between an RDD, a DataFrame, and a Dataset in Spark?
RDD is the low-level, type-safe distributed collection with no schema knowledge. DataFrame adds a named-column schema on top, enabling the Catalyst optimizer and codegen — but loses compile-time type safety. Dataset merges both worlds: it carries a schema and passes through Catalyst while remaining statically typed in Scala/Java.
How to think about it
Spark has evolved three abstraction layers. Choosing the right one affects both performance and maintainability.
RDD — Resilient Distributed Dataset
An RDD is the original Spark API: an immutable, partitioned collection of JVM objects. It gives you full control but no schema, no query optimization, and relatively slow serialization (Java/Kryo).
rdd = sc.parallelize([("alice", 30), ("bob", 25)])
result = rdd.filter(lambda row: row[1] > 26).map(lambda row: row[0])
Use RDDs when you need fine-grained control over partitioning or when working with unstructured data that cannot be expressed as a schema.
DataFrame — schema-aware tabular API
A DataFrame wraps an RDD of Row objects with a schema. Spark’s Catalyst optimizer can inspect and rewrite the logical plan, push down filters, and apply whole-stage code generation — often 10x faster than equivalent RDD code.
df = spark.createDataFrame([("alice", 30), ("bob", 25)], ["name", "age"])
df.filter("age > 26").select("name").show()
DataFrames are the default choice for ETL and analytical workloads. The trade-off is that errors in column names appear at runtime, not compile time.
Dataset — typed DataFrame (Scala/Java only)
Datasets combine the Catalyst optimizer with compile-time type safety using encoders. In Python and R, DataFrames are already Datasets under the hood, but without static typing.
# Python sees DataFrames; Dataset typing is a Scala/Java concept
ds = spark.read.parquet("users.parquet") # treated as DataFrame in PySpark
Quick comparison
| RDD | DataFrame | Dataset | |
|---|---|---|---|
| Schema | No | Yes | Yes |
| Catalyst optimization | No | Yes | Yes |
| Compile-time safety | Yes (Scala) | No | Yes (Scala/Java) |
| Python support | Yes | Yes | DataFrame only |