What is the difference between merge, join, and concat in pandas?
concat stacks DataFrames along an axis without matching keys; join aligns on the index (or a single key column) using a convenient shorthand; merge is the most general, joining on any column(s) with full SQL-style control over the join type, key names, and suffix handling.
How to think about it
What the interviewer wants to know
All three combine DataFrames, but they answer different questions. concat asks “stack these top-to-bottom (or side-by-side)”. merge asks “align rows where this column’s values match”. join is a convenience alias for merge when you want to align on the index. A strong answer names each one’s sweet spot and its failure mode.
The three tools, in plain English
concat — stack without matching. No key alignment. pandas lines up columns by name and fills gaps with NaN:
pd.concat([df1, df2]) # row-wise: stack vertically (axis=0)
pd.concat([df1, df2], axis=1) # column-wise: side by side
pd.concat([df1, df2], ignore_index=True) # reset to 0, 1, 2, ...
Use case: appending monthly export files, or adding a new set of features alongside existing ones.
merge — SQL-style join on any column(s). The most powerful and explicit option:
orders.merge(customers, on="customer_id", how="left")
# Different column names on each side
orders.merge(customers, left_on="cust_id", right_on="id")
# Multi-key join
df1.merge(df2, on=["year", "region"])
join — shorthand for index-on-index (or index-on-column) merge. Cleaner syntax when you have already set meaningful indexes:
df1.join(df2, how="left") # aligns on index of both
# Join on a specific column of the right frame
df1.join(df2.set_index("customer_id"), on="cust_id")
join is implemented as a thin wrapper around merge — there is no performance difference.
Decision table
| Scenario | Tool |
|---|---|
| Append rows from multiple DataFrames | concat(axis=0) |
| Add columns side-by-side | concat(axis=1) |
| Join on arbitrary column(s) | merge |
| Join on index | join or merge(left_index=True, right_index=True) |
| Join where column names differ | merge(left_on=..., right_on=...) |
Playground — see all three in action
The key insight
The choice is almost always between concat (when you want to stack) and merge (when you want to align on keys). join is merge with a prettier syntax — use it when both DataFrames already have the right index set.
concat is deceptively powerful: it handles schema mismatches gracefully (columns not in both DataFrames get NaN), but that same flexibility can hide problems — a misnamed column in one batch will create two separate NaN-filled columns instead of merging cleanly.