Pandas & Data Wrangling Medium Asked at AmazonAsked at MicrosoftAsked at Databricks

What are the different strategies for handling missing data in pandas — isna, fillna, dropna, and interpolate?

For Data Analyst Data Scientist ML Engineer Data Engineer

The short answer

isna/notna detect missing values; dropna removes rows or columns containing them; fillna replaces them with a scalar, dict, or forward/backward fill; interpolate estimates values from neighboring points using a chosen method. The right strategy depends on whether missingness is random, structural, or time-ordered.

How to think about it

The best answer connects each strategy to why a value is missing. “I use fillna(0)” is a red flag — it treats imputation as a mechanical step rather than a statistical choice. The strong version always detects first (df.isna().sum() / .mean() to size the problem), then maps a strategy to the scenario: dropna for small random gaps, mode/“unknown” for categoricals, median/mean for unordered numerics, ffill/interpolate for time series, fillna(0) only when missing genuinely means zero.

A worked example — three strategies, three different answers

The same gappy temp column, filled three ways — and they disagree, which is the whole point:

import pandas as pd

df = pd.DataFrame({
    "day":   [1, 2, 3, 4, 5, 6, 7],
    "temp":  [22.0, float('nan'), float('nan'), 24.5, 25.0, float('nan'), 26.0],
    "city":  ["NY", "NY", None, "LA", "LA", None, "LA"],
    "sales": [100, 200, float('nan'), 150, float('nan'), 300, 200],
})
print(df.isna().sum())     # detect first

day      0
temp     3
city     2
sales    2
dtype: int64

Three temps missing. Now fill that one column three ways:

print("median:     ", df["temp"].fillna(df["temp"].median()).tolist())
print("forward-fill:", df["temp"].ffill().tolist())
print("interpolate: ", df["temp"].interpolate().tolist())

median:      [22.0, 24.75, 24.75, 24.5, 25.0, 24.75, 26.0]
forward-fill: [22.0, 22.0, 22.0, 24.5, 25.0, 25.0, 26.0]
interpolate:  [22.0, 22.833333333333332, 23.666666666666668, 24.5, 25.0, 25.5, 26.0]

Look at days 2–3. Median stamps the global 24.75 on every gap — flat and ignoring position. Forward-fill carries the last reading (22.0) forward — “the sensor still reads what it last read.” Interpolate walks a straight line from 22.0 to 24.5 (22.83, 23.67) — “the value changed smoothly in between.” Each encodes a different assumption, and on time-ordered sensor data interpolate is usually closest to the truth. A per-column dict fills different columns by different rules in one call:

print(df.fillna({"sales": df["sales"].median(), "city": "unknown"}))

   day  temp     city  sales
0    1  22.0       NY  100.0
1    2   NaN       NY  200.0
2    3   NaN  unknown  200.0
3    4  24.5       LA  150.0
4    5  25.0       LA  200.0
5    6   NaN  unknown  300.0
6    7  26.0       LA  200.0

city gaps become “unknown” and sales gaps the median (200.0), while temp is deliberately left for a time-aware method — exactly the per-column control real data needs.

Learn it properly Missing data

What are the different strategies for handling missing data in pandas — isna, fillna, dropna, and interpolate?

A worked example — three strategies, three different answers

Keep practising

Explore further