datarekha
Pandas & Data Wrangling Medium Asked at DatabricksAsked at SnowflakeAsked at Amazon

How do you reduce memory usage in a pandas DataFrame using dtypes, category encoding, and downcasting?

The short answer

The biggest wins come from converting low-cardinality string columns to category dtype (often 10x smaller), downcasting int64 and float64 to the smallest type that fits the data range, and using sparse arrays or chunked reads for data that doesn't need to live fully in memory.

How to think about it

Why this question comes up in interviews

Memory is often the first wall you hit when scaling a pandas workflow beyond a few million rows. The interviewer wants to know you have a systematic diagnostic process — not just “convert everything to float32” — and that you understand the trade-offs before applying any optimization.

The diagnostic process

The first step is always measurement. Before changing anything, run df.memory_usage(deep=True) to see which columns are eating the most memory. deep=True matters for object-dtype columns because it actually inspects the Python string objects rather than just counting pointers.

See before/after in the playground

The priority order

Column typeTechniqueTypical saving
String with few unique valuesastype("category")5–20x
int64 in range -128 to 127downcast to int88x
int64 in range -32k to 32kdowncast to int164x
float64 in ML featuresdowncast to float322x
Mostly-missing numericpd.arrays.SparseArrayproportional to sparsity

Applying dtype hints at load time

Even better is avoiding the large allocation in the first place:

df = pd.read_csv(
    "transactions.csv",
    dtype={
        "region": "category",
        "status": "category",
        "quantity": "int8",
    },
)

This means pandas never holds the full int64/object version — the memory is never allocated.

Nullable integers for columns with NaN

Regular int64 can’t hold NaN, so pandas upgrades int columns with missing values to float64. Use the capital-I nullable types to avoid this:

df["units"] = df["units"].astype("Int32")   # capital-I = nullable integer

When the data doesn’t fit at all

If even after optimization the data won’t fit in RAM, use chunksize in read_csv to process in pieces, or switch to Polars or DuckDB which can operate out-of-core.

Learn it properly Memory optimization

Keep practising

All Pandas & Data Wrangling questions

Explore further

Skip to content