How do you reduce memory usage in a pandas DataFrame using dtypes, category encoding, and downcasting?
The biggest wins come from converting low-cardinality string columns to category dtype (often 10x smaller), downcasting int64 and float64 to the smallest type that fits the data range, and using sparse arrays or chunked reads for data that doesn't need to live fully in memory.
How to think about it
Why this question comes up in interviews
Memory is often the first wall you hit when scaling a pandas workflow beyond a few million rows. The interviewer wants to know you have a systematic diagnostic process — not just “convert everything to float32” — and that you understand the trade-offs before applying any optimization.
The diagnostic process
The first step is always measurement. Before changing anything, run df.memory_usage(deep=True) to see which columns are eating the most memory. deep=True matters for object-dtype columns because it actually inspects the Python string objects rather than just counting pointers.
See before/after in the playground
The priority order
| Column type | Technique | Typical saving |
|---|---|---|
| String with few unique values | astype("category") | 5–20x |
| int64 in range -128 to 127 | downcast to int8 | 8x |
| int64 in range -32k to 32k | downcast to int16 | 4x |
| float64 in ML features | downcast to float32 | 2x |
| Mostly-missing numeric | pd.arrays.SparseArray | proportional to sparsity |
Applying dtype hints at load time
Even better is avoiding the large allocation in the first place:
df = pd.read_csv(
"transactions.csv",
dtype={
"region": "category",
"status": "category",
"quantity": "int8",
},
)
This means pandas never holds the full int64/object version — the memory is never allocated.
Nullable integers for columns with NaN
Regular int64 can’t hold NaN, so pandas upgrades int columns with missing values to float64. Use the capital-I nullable types to avoid this:
df["units"] = df["units"].astype("Int32") # capital-I = nullable integer
When the data doesn’t fit at all
If even after optimization the data won’t fit in RAM, use chunksize in read_csv to process in pieces, or switch to Polars or DuckDB which can operate out-of-core.