datarekha
Pandas & Data Wrangling Medium Asked at NetflixAsked at LyftAsked at DoorDash

How does the categorical dtype reduce memory and speed up operations in pandas?

The short answer

Categorical dtype stores a column's unique values once in a lookup table and represents each row as a small integer code, replacing repeated Python string objects. This cuts memory by an order of magnitude for low-cardinality string columns and accelerates GroupBy, sorting, and equality comparisons because pandas operates on integer codes rather than string comparisons.

How to think about it

What is really being asked

The interviewer is probing whether you think about memory layout and not just correctness. Every Python string is its own heap object. In a DataFrame with a million rows and only five unique cities, you have a million separate string objects — all storing the same five values over and over. Categorical dtype fixes that by storing each unique value once and replacing the column with a tiny integer code array.

The mechanics — categories and codes

Under the hood, a categorical column has two parts:

  • A categories array — the unique values, stored once (e.g., ["Berlin", "London", "Tokyo"])
  • A codes array — one integer per row pointing into the categories array (e.g., int8 values 0, 1, 2)

That is why 1 million rows with 5 unique cities collapses from ~65 MB (one Python string object per row) to ~1 MB (one int8 per row plus a 5-entry lookup table).

When it also speeds things up

Because GroupBy and sorting operate on the integer codes rather than on string comparisons, they run measurably faster on categorical columns — especially when the cardinality is low and the DataFrame is large.

Ordered categories unlock comparison operators (<, >) in natural domain order, not alphabetically:

size_cat = pd.CategoricalDtype(["S", "M", "L", "XL"], ordered=True)
df["size"] = df["size"].astype(size_cat)
df.sort_values("size")   # sorts S < M < L < XL, not alphabetically
df["size"] > "M"         # True where size is L or XL

See the memory difference yourself

When NOT to use categorical

  • High-cardinality columns (UUIDs, emails, free text): the lookup table itself becomes large and memory savings disappear.
  • Frequently changing columns: adding a new category requires cat.add_categories() — forgetting this step raises ValueError in production.
# Extending the category list before inserting a new value
df["city"] = df["city"].cat.add_categories(["Seoul"])

Rule of thumb: if a string column has fewer than roughly 5% unique values relative to its length, categorical will save memory. For a 1M-row column with 4 unique values, the saving is ~30x.

Learn it properly Memory optimization

Keep practising

All Pandas & Data Wrangling questions

Explore further

Skip to content