How does the category dtype work in pandas and when should you use it?
CategoricalDtype stores a column as integer codes plus a small lookup table of unique values, dramatically reducing memory for low-cardinality string columns. It also enforces a fixed set of valid values, enables natural ordering, and speeds up groupby and sort operations.
How to think about it
What this question is really about
The interviewer is checking whether you think about data representation, not just computation. When you load a CSV with a “status” column that repeats “pending”, “shipped”, “delivered” across a million rows, pandas stores each of those as a separate Python string object. That is wasteful and slow. CategoricalDtype is pandas’ answer: store the unique values once and represent every row as a tiny integer pointing at the lookup table.
The two things stored inside a categorical column
A categorical column is really two arrays bundled together:
categories: the unique values — stored once, typically a short listcodes: one integer per row (oftenint8) pointing into the categories list
So a million-row “status” column with four possible values becomes: a 4-element categories array plus a million int8 codes. The original object column is roughly 64 bytes per row (Python string overhead); the codes are 1 byte per row. That is a ~30x reduction.
The three benefits beyond memory
- Faster GroupBy and sort — comparisons happen on integer codes, not string comparisons.
- Enforced valid values — assigning a value not in the categories list raises
ValueErrorimmediately, catching upstream data quality issues early. - Ordered categories — you can declare
S < M < L < XLand sorting and comparisons respect that order, not the alphabetical one.
Playground — see the memory and ordering in action
Practical rule of thumb
Use category when a string column has fewer than roughly 5–10% unique values relative to its length. At read time, you can skip the double allocation entirely:
df = pd.read_csv("data.csv", dtype={"status": "category", "region": "category"})
For ordered categories — like survey responses (Strongly Disagree → Strongly Agree), T-shirt sizes, or severity levels — the ordered dtype also unlocks min(), max(), and > comparisons that follow your declared order rather than alphabetical order.