Pandas & Data Wrangling Medium Asked at DatabricksAsked at AmazonAsked at Snowflake

How does the category dtype work in pandas and when should you use it?

For Data Analyst Data Scientist Data Engineer ML Engineer

The short answer

CategoricalDtype stores a column as integer codes plus a small lookup table of unique values, dramatically reducing memory for low-cardinality string columns. It also enforces a fixed set of valid values, enables natural ordering, and speeds up groupby and sort operations.

How to think about it

The interviewer is checking whether you think about data representation, not just computation. Load a CSV whose “status” column repeats “pending”/“shipped”/“delivered” across a million rows and pandas stores each as a separate Python string object — wasteful and slow. CategoricalDtype stores the unique values once and represents every row as a tiny integer code pointing into that lookup table.

A categorical column is really two arrays: categories (the unique values, stored once) and codes (one small integer per row, often int8). Beyond memory at scale, that buys three things: GroupBy/sort run on integer codes not strings; assigning an out-of-list value raises immediately (catching bad data); and ordered categories let S < M < L < XL sort and compare by domain order, not alphabet.

A worked example — and the small-frame catch

Watch the memory number carefully — on a tiny frame, converting actually costs more:

import pandas as pd

df = pd.DataFrame({
    "order_id": [101, 102, 103, 104, 105, 106],
    "status":   ["pending", "shipped", "delivered", "cancelled", "pending", "shipped"],
    "region":   ["North", "South", "North", "East", "East", "West"],
})
print("Memory:", df.memory_usage(deep=True).sum(), "bytes")     # before

df["status"] = df["status"].astype("category")
df["region"] = df["region"].astype("category")
print("Memory:", df.memory_usage(deep=True).sum(), "bytes")     # after

Memory: 933 bytes
Memory: 1038 bytes

Memory went up — 933 → 1038. That’s not a contradiction: the categories array and code machinery have a fixed overhead, and on six rows with little repetition it outweighs the savings. The win only appears when the column is long and low-cardinality (the same column shape over 50,000 rows shrinks ~60×). The rule of thumb: category pays off below roughly 5–10% unique values relative to length.

The other benefits hold at any size. The internal codes and ordered sorting work exactly as designed:

print("Status categories:", df["status"].cat.categories.tolist())
print("Status codes:", df["status"].cat.codes.tolist())

size_type = pd.CategoricalDtype(["XS", "S", "M", "L", "XL"], ordered=True)
sizes = pd.DataFrame({"item": ["shirt", "hat", "coat", "sock", "jacket"],
                      "size": pd.Categorical(["M", "XL", "S", "XS", "L"], dtype=size_type)})
print(sizes.sort_values("size"))
print(sizes[sizes["size"] > "M"])

Status categories: ['cancelled', 'delivered', 'pending', 'shipped']
Status codes: [2, 3, 1, 0, 2, 3]
     item size
3    sock   XS
2    coat    S
0   shirt    M
4  jacket    L
1     hat   XL
     item size
1     hat   XL
4  jacket    L

The codes [2, 3, 1, 0, 2, 3] index alphabetically-sorted categories (cancelled=0 … shipped=3). And the ordered dtype sorts XS→XL by size, and > "M" correctly returns only L and XL — comparisons a plain string column would do alphabetically, wrongly. At read time you can skip the double allocation with pd.read_csv(..., dtype={"status": "category"}).

Learn it properly Memory optimization

How does the category dtype work in pandas and when should you use it?

A worked example — and the small-frame catch

Keep practising

Explore further