datarekha
Pandas & Data Wrangling Medium Asked at AmazonAsked at MicrosoftAsked at Databricks

What are the different strategies for handling missing data in pandas — isna, fillna, dropna, and interpolate?

The short answer

isna/notna detect missing values; dropna removes rows or columns containing them; fillna replaces them with a scalar, dict, or forward/backward fill; interpolate estimates values from neighboring points using a chosen method. The right strategy depends on whether missingness is random, structural, or time-ordered.

How to think about it

What the interviewer is listening for

The best answer connects each strategy to why values are missing. If you just say “I use fillna(0)”, that is a red flag — it shows you are treating imputation as a mechanical step rather than a deliberate statistical choice. A strong answer covers detection, then maps each filling strategy to the scenarios where it makes sense.

Step 1 — Always detect first

Before filling anything, understand the scope:

df.isna().sum()                    # missing count per column
df.isna().mean()                   # fraction missing per column
df[df["temp"].isna()]              # which rows are affected

Step 2 — Choose a strategy based on WHY data is missing

ScenarioStrategy
Random missingness, small fractiondropna
Categorical columnfillna("unknown") or mode
Numeric, no time orderfillna(median) or fillna(mean)
Time series / sensor dataffill then bfill, or interpolate
Structural gap (known value)fillna(0) with documentation

dropna — remove rows with missing values. Use only when the fraction is small and missingness is random:

df.dropna()                        # drop if ANY column is NaN
df.dropna(how="all")               # drop only if ALL columns are NaN
df.dropna(subset=["temp"])         # drop only if a specific column is NaN
df.dropna(thresh=2)                # keep rows with at least 2 non-NaN values

fillna — replace with a value. The value should be statistically defensible:

df["sales"].fillna(df["sales"].median())        # safe for skewed data
df.fillna({"sales": 0, "city": "unknown"})      # per-column control

Forward/backward fill — propagate the last (or next) valid value. Natural for time series:

df["temp"].ffill()    # fill gap with last known value
df["temp"].bfill()    # fill gap with next known value

interpolate — estimate from neighbors. Default is linear:

df["temp"].interpolate(method="linear")          # estimate midpoints
# df["temp"].interpolate(method="time")          # needs DatetimeIndex

Playground — compare strategies side by side

The key insight — imputation is a choice, not a default

Each strategy encodes an assumption about why data is missing:

  • ffill says “the last known value is still valid” — reasonable for sensor readings, stock prices, or any slowly changing quantity.
  • interpolate says “values change smoothly between neighbors” — reasonable for temperature, but nonsensical for categorical data.
  • fillna(median) says “the missing value is a typical value” — reasonable for random dropout in surveys.
  • fillna(0) says “missing means zero” — only reasonable when that is actually true in your domain.
Learn it properly Missing data

Keep practising

All Pandas & Data Wrangling questions

Explore further

Skip to content