How do you detect and remove duplicate rows in pandas, and how do you control which duplicate to keep?
duplicated() returns a boolean mask of rows that are duplicates of an earlier row; drop_duplicates() removes them. Both accept a subset parameter to restrict comparison to specific columns and a keep parameter ('first', 'last', or False) to control which instance is retained or whether all copies are dropped.
How to think about it
What the interviewer is really probing
Duplicates are one of the most common data quality issues, and they creep in from unexpected places — multiple source feeds, botched joins, or re-processed batches. A good answer shows you understand subset (comparing on business key columns, not all columns) and keep (controlling which copy survives), and that you know duplicates are often introduced silently by joins.
The two main tools
duplicated() returns a boolean mask — True for rows that are duplicates of a row appearing earlier. The first occurrence is False by default:
df.duplicated() # True on 2nd, 3rd... copies
df.duplicated().sum() # count duplicates
df[df.duplicated(keep=False)] # show ALL copies of duplicated rows
drop_duplicates() removes the flagged rows and returns a clean DataFrame:
df.drop_duplicates() # keep first (default)
df.drop_duplicates(keep="last") # keep last
df.drop_duplicates(keep=False) # drop ALL copies of any duplicate
The critical parameter: subset
By default both methods compare every column. In practice you almost always want to define a business key — the columns that make a row logically unique — and compare on those:
# order_id + product is the business key; qty difference = real disagreement or duplicate?
df.drop_duplicates(subset=["order_id", "product"], keep="last")
Choosing keep="last" here preserves the most recent record if data was appended multiple times.
Playground — detect, inspect, deduplicate
Detecting duplicates introduced by joins
Joins are the most common source of unexpected duplication. After any merge, check:
before = len(df)
merged = df.merge(lookup, on="key", how="left")
after = len(merged)
print(f"Row count: {before} → {after}") # should be equal for m:1
If after > before, your right-side key was not unique — you have a fan-out. Use validate="many_to_one" on the merge to catch this automatically.