Pandas & Data Wrangling Medium Asked at NetflixAsked at LyftAsked at DoorDash

How does the categorical dtype reduce memory and speed up operations in pandas?

For Data Analyst Data Scientist Data Engineer

The short answer

Categorical dtype stores a column's unique values once in a lookup table and represents each row as a small integer code, replacing repeated Python string objects. This cuts memory by an order of magnitude for low-cardinality string columns and accelerates GroupBy, sorting, and equality comparisons because pandas operates on integer codes rather than string comparisons.

How to think about it

The interviewer is probing whether you think about memory layout, not just correctness. Every Python string is its own heap object, so a million-row column with five unique cities holds a million separate string objects — all repeating the same five values. Categorical dtype fixes that by storing each unique value once and replacing the column with a tiny integer-code array.

Under the hood a categorical column is two parts: a categories array (the unique values, stored once — e.g. ["Berlin", "London", "Tokyo"]) and a codes array (one small integer per row pointing into it). Because GroupBy, sorting, and equality then operate on the integer codes rather than string comparisons, they also run faster on low-cardinality columns.

A worked example — the memory drop

Two low-cardinality string columns over 50,000 rows, measured before and after conversion:

import pandas as pd
import numpy as np
np.random.seed(42)
n = 50_000
df = pd.DataFrame({
    "city":   np.random.choice(["New York", "London", "Tokyo", "Berlin", "Paris"], n),
    "status": np.random.choice(["active", "inactive", "pending"], n),
})

before = df.memory_usage(deep=True)
print("Before (object dtype):"); print(before)
print("Total:", before.sum() // 1000, "KB")

df["city"]   = df["city"].astype("category")
df["status"] = df["status"].astype("category")

after = df.memory_usage(deep=True)
print("\nAfter (category dtype):"); print(after)
print("Total:", after.sum() // 1000, "KB")

Before (object dtype):
Index         128
city      3149946
status    3199920
dtype: int64
Total: 6349 KB

After (category dtype):
Index       128
city      50487
status    50300
dtype: int64
Total: 100 KB

From 6349 KB to 100 KB — about a 63× reduction. The city column alone fell from ~3.1 MB of string objects to ~50 KB: one int8 code per row plus a five-entry lookup table. Peeking at the internals confirms the design, and an ordered dtype unlocks natural-order sorting:

print("Categories:", df["city"].cat.categories.tolist())
print("First 5 codes:", df["city"].cat.codes[:5].tolist())

size_type = pd.CategoricalDtype(["XS", "S", "M", "L", "XL"], ordered=True)
df2 = pd.DataFrame({"size": pd.Categorical(["M", "XL", "S", "L", "XS"], dtype=size_type)})
print(df2.sort_values("size"))

Categories: ['Berlin', 'London', 'New York', 'Paris', 'Tokyo']
First 5 codes: [0, 3, 4, 3, 3]
  size
4   XS
2    S
0    M
3    L
1   XL

The codes [0, 3, 4, 3, 3] are just indices into the categories list (0 → Berlin, 3 → Paris, 4 → Tokyo). And the ordered dtype sorts XS < S < M < L < XL by domain order, not alphabetically — which a plain string column can’t do. Rule of thumb: if a string column has under ~5% unique values relative to its length, categorical saves memory; high-cardinality columns (UUIDs, emails) don’t benefit because the lookup table itself balloons.

Learn it properly Memory optimization

How does the categorical dtype reduce memory and speed up operations in pandas?

A worked example — the memory drop

Keep practising

Explore further