datarekha
Pandas & Data Wrangling Medium Asked at DatabricksAsked at AmazonAsked at Snowflake

How does the category dtype work in pandas and when should you use it?

The short answer

CategoricalDtype stores a column as integer codes plus a small lookup table of unique values, dramatically reducing memory for low-cardinality string columns. It also enforces a fixed set of valid values, enables natural ordering, and speeds up groupby and sort operations.

How to think about it

What this question is really about

The interviewer is checking whether you think about data representation, not just computation. When you load a CSV with a “status” column that repeats “pending”, “shipped”, “delivered” across a million rows, pandas stores each of those as a separate Python string object. That is wasteful and slow. CategoricalDtype is pandas’ answer: store the unique values once and represent every row as a tiny integer pointing at the lookup table.

The two things stored inside a categorical column

A categorical column is really two arrays bundled together:

  • categories: the unique values — stored once, typically a short list
  • codes: one integer per row (often int8) pointing into the categories list

So a million-row “status” column with four possible values becomes: a 4-element categories array plus a million int8 codes. The original object column is roughly 64 bytes per row (Python string overhead); the codes are 1 byte per row. That is a ~30x reduction.

The three benefits beyond memory

  1. Faster GroupBy and sort — comparisons happen on integer codes, not string comparisons.
  2. Enforced valid values — assigning a value not in the categories list raises ValueError immediately, catching upstream data quality issues early.
  3. Ordered categories — you can declare S < M < L < XL and sorting and comparisons respect that order, not the alphabetical one.

Playground — see the memory and ordering in action

Practical rule of thumb

Use category when a string column has fewer than roughly 5–10% unique values relative to its length. At read time, you can skip the double allocation entirely:

df = pd.read_csv("data.csv", dtype={"status": "category", "region": "category"})

For ordered categories — like survey responses (Strongly Disagree → Strongly Agree), T-shirt sizes, or severity levels — the ordered dtype also unlocks min(), max(), and > comparisons that follow your declared order rather than alphabetical order.

Learn it properly Memory optimization

Keep practising

All Pandas & Data Wrangling questions

Explore further

Skip to content