What are the different strategies for handling missing data in pandas — isna, fillna, dropna, and interpolate?
isna/notna detect missing values; dropna removes rows or columns containing them; fillna replaces them with a scalar, dict, or forward/backward fill; interpolate estimates values from neighboring points using a chosen method. The right strategy depends on whether missingness is random, structural, or time-ordered.
How to think about it
What the interviewer is listening for
The best answer connects each strategy to why values are missing. If you just say “I use fillna(0)”, that is a red flag — it shows you are treating imputation as a mechanical step rather than a deliberate statistical choice. A strong answer covers detection, then maps each filling strategy to the scenarios where it makes sense.
Step 1 — Always detect first
Before filling anything, understand the scope:
df.isna().sum() # missing count per column
df.isna().mean() # fraction missing per column
df[df["temp"].isna()] # which rows are affected
Step 2 — Choose a strategy based on WHY data is missing
| Scenario | Strategy |
|---|---|
| Random missingness, small fraction | dropna |
| Categorical column | fillna("unknown") or mode |
| Numeric, no time order | fillna(median) or fillna(mean) |
| Time series / sensor data | ffill then bfill, or interpolate |
| Structural gap (known value) | fillna(0) with documentation |
dropna — remove rows with missing values. Use only when the fraction is small and missingness is random:
df.dropna() # drop if ANY column is NaN
df.dropna(how="all") # drop only if ALL columns are NaN
df.dropna(subset=["temp"]) # drop only if a specific column is NaN
df.dropna(thresh=2) # keep rows with at least 2 non-NaN values
fillna — replace with a value. The value should be statistically defensible:
df["sales"].fillna(df["sales"].median()) # safe for skewed data
df.fillna({"sales": 0, "city": "unknown"}) # per-column control
Forward/backward fill — propagate the last (or next) valid value. Natural for time series:
df["temp"].ffill() # fill gap with last known value
df["temp"].bfill() # fill gap with next known value
interpolate — estimate from neighbors. Default is linear:
df["temp"].interpolate(method="linear") # estimate midpoints
# df["temp"].interpolate(method="time") # needs DatetimeIndex
Playground — compare strategies side by side
The key insight — imputation is a choice, not a default
Each strategy encodes an assumption about why data is missing:
ffillsays “the last known value is still valid” — reasonable for sensor readings, stock prices, or any slowly changing quantity.interpolatesays “values change smoothly between neighbors” — reasonable for temperature, but nonsensical for categorical data.fillna(median)says “the missing value is a typical value” — reasonable for random dropout in surveys.fillna(0)says “missing means zero” — only reasonable when that is actually true in your domain.