How does boolean indexing work in pandas, and what are the common pitfalls?
Boolean indexing filters a DataFrame by passing a boolean Series or array of the same length as the index. Common pitfalls include using Python's and/or instead of &/| and forgetting to wrap compound conditions in parentheses, both of which raise errors or produce wrong results.
How to think about it
What is really being tested
The interviewer wants to know whether you understand that pandas filtering works on whole-array boolean masks, not on row-by-row Python logic. Get this right and you unlock fast, readable filters. Get it wrong and you hit cryptic ValueErrors or, worse, silently wrong results.
How it works, step by step
A boolean mask is just a Series of True/False values aligned to the DataFrame’s index. When you write df["age"] > 30, pandas evaluates that comparison across the entire column at once (in C), producing a boolean Series. Passing that Series back into df[...] keeps only the rows where the mask is True.
df["age"] > 30
# row 0: False
# row 1: True
# row 2: False
# ...
Single vs. compound conditions
For a single condition, the syntax is straightforward:
seniors = df[df["age"] > 30]
For compound conditions you must use & (AND) and | (OR) — not Python’s and/or — and you must wrap each condition in parentheses because &/| have higher operator precedence than == or >:
# Correct
eng_senior = df[(df["dept"] == "eng") & (df["age"] > 30)]
# Wrong — raises ValueError
eng_senior = df[df["dept"] == "eng" and df["age"] > 30]
Other useful patterns
isin — set membership check, cleaner than chaining == with |:
target = df[df["dept"].isin(["eng", "mkt"])]
~ — negation of a mask:
not_eng = df[~(df["dept"] == "eng")]
query() — readable string syntax for ad-hoc exploration:
high_pay = df.query("salary > 80000 and dept == 'eng'")
Writing back with loc — always use loc for assignments to avoid the SettingWithCopyWarning:
df.loc[df["salary"] < 50000, "salary"] = 50000
Try it yourself
Why query() is for exploration, not production
query() evaluates strings at runtime. It is convenient for interactive work but makes code harder to compose programmatically (building filter strings dynamically is fragile) and slightly slower on very large DataFrames. Stick to mask syntax in production pipelines.