How does the categorical dtype reduce memory and speed up operations in pandas?
Categorical dtype stores a column's unique values once in a lookup table and represents each row as a small integer code, replacing repeated Python string objects. This cuts memory by an order of magnitude for low-cardinality string columns and accelerates GroupBy, sorting, and equality comparisons because pandas operates on integer codes rather than string comparisons.
How to think about it
What is really being asked
The interviewer is probing whether you think about memory layout and not just correctness. Every Python string is its own heap object. In a DataFrame with a million rows and only five unique cities, you have a million separate string objects — all storing the same five values over and over. Categorical dtype fixes that by storing each unique value once and replacing the column with a tiny integer code array.
The mechanics — categories and codes
Under the hood, a categorical column has two parts:
- A categories array — the unique values, stored once (e.g.,
["Berlin", "London", "Tokyo"]) - A codes array — one integer per row pointing into the categories array (e.g.,
int8values 0, 1, 2)
That is why 1 million rows with 5 unique cities collapses from ~65 MB (one Python string object per row) to ~1 MB (one int8 per row plus a 5-entry lookup table).
When it also speeds things up
Because GroupBy and sorting operate on the integer codes rather than on string comparisons, they run measurably faster on categorical columns — especially when the cardinality is low and the DataFrame is large.
Ordered categories unlock comparison operators (<, >) in natural domain order, not alphabetically:
size_cat = pd.CategoricalDtype(["S", "M", "L", "XL"], ordered=True)
df["size"] = df["size"].astype(size_cat)
df.sort_values("size") # sorts S < M < L < XL, not alphabetically
df["size"] > "M" # True where size is L or XL
See the memory difference yourself
When NOT to use categorical
- High-cardinality columns (UUIDs, emails, free text): the lookup table itself becomes large and memory savings disappear.
- Frequently changing columns: adding a new category requires
cat.add_categories()— forgetting this step raisesValueErrorin production.
# Extending the category list before inserting a new value
df["city"] = df["city"].cat.add_categories(["Seoul"])
Rule of thumb: if a string column has fewer than roughly 5% unique values relative to its length, categorical will save memory. For a 1M-row column with 4 unique values, the saving is ~30x.