datarekha
Pandas & Data Wrangling Easy Asked at AmazonAsked at AccentureAsked at Walmart

How do you detect and remove duplicate rows in pandas, and how do you control which duplicate to keep?

The short answer

duplicated() returns a boolean mask of rows that are duplicates of an earlier row; drop_duplicates() removes them. Both accept a subset parameter to restrict comparison to specific columns and a keep parameter ('first', 'last', or False) to control which instance is retained or whether all copies are dropped.

How to think about it

What the interviewer is really probing

Duplicates are one of the most common data quality issues, and they creep in from unexpected places — multiple source feeds, botched joins, or re-processed batches. A good answer shows you understand subset (comparing on business key columns, not all columns) and keep (controlling which copy survives), and that you know duplicates are often introduced silently by joins.

The two main tools

duplicated() returns a boolean mask — True for rows that are duplicates of a row appearing earlier. The first occurrence is False by default:

df.duplicated()               # True on 2nd, 3rd... copies
df.duplicated().sum()         # count duplicates
df[df.duplicated(keep=False)] # show ALL copies of duplicated rows

drop_duplicates() removes the flagged rows and returns a clean DataFrame:

df.drop_duplicates()                       # keep first (default)
df.drop_duplicates(keep="last")            # keep last
df.drop_duplicates(keep=False)             # drop ALL copies of any duplicate

The critical parameter: subset

By default both methods compare every column. In practice you almost always want to define a business key — the columns that make a row logically unique — and compare on those:

# order_id + product is the business key; qty difference = real disagreement or duplicate?
df.drop_duplicates(subset=["order_id", "product"], keep="last")

Choosing keep="last" here preserves the most recent record if data was appended multiple times.

Playground — detect, inspect, deduplicate

Detecting duplicates introduced by joins

Joins are the most common source of unexpected duplication. After any merge, check:

before = len(df)
merged = df.merge(lookup, on="key", how="left")
after  = len(merged)
print(f"Row count: {before}{after}")          # should be equal for m:1

If after > before, your right-side key was not unique — you have a fan-out. Use validate="many_to_one" on the merge to catch this automatically.

Learn it properly DataFrame basics

Keep practising

All Pandas & Data Wrangling questions

Explore further

Skip to content