Pandas & Data Wrangling Easy Asked at AmazonAsked at AccentureAsked at Walmart

How do you detect and remove duplicate rows in pandas, and how do you control which duplicate to keep?

For Data Analyst Data Scientist Data Engineer

The short answer

duplicated() returns a boolean mask of rows that are duplicates of an earlier row; drop_duplicates() removes them. Both accept a subset parameter to restrict comparison to specific columns and a keep parameter ('first', 'last', or False) to control which instance is retained or whether all copies are dropped.

How to think about it

Duplicates are one of the most common data-quality issues, creeping in from multiple feeds, botched joins, or reprocessed batches. A good answer shows you understand the two knobs: subset (compare on the business key, not every column) and keep ("first", "last", or False to drop all copies). duplicated() returns a boolean mask — True on the 2nd, 3rd… copy — and drop_duplicates() removes the flagged rows.

A worked example — subset is the whole game

Watch how “duplicate” changes depending on which columns you compare. Rows 1–2 are fully identical, but rows 3–4 share the key (3, C) yet differ in qty:

import pandas as pd

df = pd.DataFrame({"order_id": [1, 2, 2, 3, 3, 4],
                   "product":  ["A", "B", "B", "C", "C", "D"],
                   "qty":      [5, 2, 2, 3, 4, 1],
                   "ts":       ["09:00", "09:05", "09:05", "09:10", "09:15", "09:20"]})

print("Duplicates (all columns):", df.duplicated().sum())
print(df[df.duplicated(keep=False)])     # show every copy of a full duplicate

Duplicates (all columns): 1
   order_id product  qty     ts
1         2       B    2  09:05
2         2       B    2  09:05

Comparing all columns, only the (2, B) pair counts — one duplicate. The (3, C) rows escape because their qty differs. Now switch to the business key order_id + product, and the picture changes:

print(df.drop_duplicates(subset=["order_id", "product"], keep="last"))
print("(order_id, product) unique?:", not df.duplicated(subset=["order_id", "product"]).any())

   order_id product  qty     ts
0         1       A    5  09:00
2         2       B    2  09:05
4         3       C    4  09:15
5         4       D    1  09:20

(order_id, product) unique?: False

Now (3, C) is a duplicate, and keep="last" keeps the qty-4 row (09:15) over qty-3 — the right call if later rows are corrections. The key was not unique, which the boolean check confirms. That’s the lesson: the default all-column compare would have silently kept both (3, C) rows as “distinct,” when logically they’re the same order. (keep=False drops every copy of a duplicate — useful for isolating only the clean, never-duplicated rows.)

Learn it properly DataFrame basics

How do you detect and remove duplicate rows in pandas, and how do you control which duplicate to keep?

A worked example — subset is the whole game

Keep practising

Explore further