Pandas & Data Wrangling Medium Asked at DatabricksAsked at SnowflakeAsked at Amazon

How do you reduce memory usage in a pandas DataFrame using dtypes, category encoding, and downcasting?

For Data Scientist ML Engineer Data Engineer

The short answer

The biggest wins come from converting low-cardinality string columns to category dtype (often 10x smaller), downcasting int64 and float64 to the smallest type that fits the data range, and using sparse arrays or chunked reads for data that doesn't need to live fully in memory.

How to think about it

Memory is usually the first wall you hit scaling pandas past a few million rows, and the interviewer wants to see a process, not a reflex. “Convert everything to float32” is the wrong answer. The right one starts with measurement: run df.memory_usage(deep=True) to see which columns actually cost you. The deep=True matters for object columns — it inspects the real Python string objects instead of just counting pointers, and object columns are almost always the heaviest thing in the frame.

Once you know where the weight is, three moves do most of the work. Low-cardinality strings — a region column with four distinct values repeated thousands of times — become category, which stores each label once and keeps tiny integer codes per row. Wide integers shrink to the narrowest type that still holds the range: a quantity that never exceeds 127 fits in int8 instead of int64, eight times smaller. And float64 feature columns drop to float32, halving their footprint.

Before and after, measured

Here is the whole pipeline on an 8,000-row frame — measure, then apply all three moves, then measure again:

import pandas as pd
import numpy as np

rng = np.random.default_rng(0)
n = 8_000

df = pd.DataFrame({
    "transaction_id": rng.integers(1, 1_000_000, n).astype("int64"),
    "region":         rng.choice(["East", "West", "North", "South"], n),
    "status":         rng.choice(["active", "pending", "closed"], n),
    "quantity":       rng.integers(1, 127, n).astype("int64"),
    "price":          rng.uniform(1.0, 500.0, n).astype("float64"),
    "score":          rng.random(n).astype("float64"),
})

before_mb = df.memory_usage(deep=True).sum() / 1024
print(f"BEFORE: {before_mb:.1f} KB")
print(df.dtypes)
print()

df["region"]  = df["region"].astype("category")    # 4 unique values
df["status"]  = df["status"].astype("category")     # 3 unique values
df["quantity"]       = df["quantity"].astype("int8")    # fits -128..127
df["transaction_id"] = df["transaction_id"].astype("int32")  # still needs range
df["price"] = df["price"].astype("float32")
df["score"] = df["score"].astype("float32")

after_mb = df.memory_usage(deep=True).sum() / 1024
print(f"AFTER:  {after_mb:.1f} KB")
print(df.dtypes)
print(f"\nReduction: {(1 - after_mb/before_mb)*100:.0f}%")

BEFORE: 1225.4 KB
transaction_id      int64
region             object
status             object
quantity            int64
price             float64
score             float64
dtype: object

AFTER:  118.0 KB
transaction_id       int32
region            category
status            category
quantity              int8
price              float32
score              float32
dtype: object

Reduction: 90%

A 90% cut, and the dtype listing shows exactly where it came from: two object columns turned category (those repeated four- and three-value strings collapse to codes), quantity went int64 → int8, and the floats halved. The two object columns alone accounted for most of the original 1.2 MB — which is why detecting the heavy columns first beats blindly downcasting numerics.

Column type	Technique	Typical saving
String with few unique values	`astype("category")`	5–20x
int64 in range -128 to 127	downcast to int8	8x
int64 in range -32k to 32k	downcast to int16	4x
float64 in ML features	downcast to float32	2x
Mostly-missing numeric	`pd.arrays.SparseArray`	proportional to sparsity

Better still is never allocating the wide version at all. Pass dtype={"region": "category", "status": "category", "quantity": "int8"} to read_csv and pandas reads straight into the small types — the int64/object allocation never happens. One more gotcha: regular int64 can’t hold NaN, so a missing value silently upgrades the whole column to float64. Reach for the capital-I nullable type, df["units"].astype("Int32"), to keep an integer column integer. And when even optimized data won’t fit, chunksize in read_csv streams it in pieces, or Polars and DuckDB run it out-of-core.

Learn it properly Memory optimization

How do you reduce memory usage in a pandas DataFrame using dtypes, category encoding, and downcasting?

Before and after, measured

Keep practising

Explore further