Explain demographic parity vs equalized odds. Can you satisfy both at once?

Demographic parity requires equal positive-prediction rates across groups, ignoring the true label; equalized odds requires equal true-positive and false-positive rates across groups, conditioning on the true label. In general you cannot satisfy both simultaneously (except in degenerate cases), because of impossibility results when base rates differ. Which metric to use depends on the harm you're trying to prevent.

Where does bias enter an ML pipeline, and what mitigation options do you have at each stage?

Bias can enter through the data (historical, sampling, or labeling bias), the features (proxies for protected attributes), the objective (optimizing only for accuracy), and deployment (feedback loops). Mitigations are grouped into pre-processing (reweighting or resampling data), in-processing (adding fairness constraints during training), and post-processing (adjusting thresholds per group). Removing the protected attribute alone is insufficient because of proxy variables.

Explain the bias-variance tradeoff and how you'd diagnose which one you have.

Bias is error from oversimplifying assumptions (underfitting); variance is error from sensitivity to the training set (overfitting). Total error decomposes into bias squared, variance, and irreducible noise, and reducing one often increases the other. You diagnose by comparing training and validation error: high error on both means high bias, while a large gap (low train, high validation) means high variance.

What is the accuracy paradox and how does it expose the failure of accuracy as a metric?

The accuracy paradox occurs when a trivial model — one that always predicts the majority class — achieves high accuracy on an imbalanced dataset despite having zero predictive power for the minority class. A model that predicts 'not fraud' on every transaction achieves 99.9% accuracy if fraud is 0.1% of the data, but its recall for fraud is zero. Accuracy is only meaningful when classes are roughly balanced.

Fairness & bias in ML — Machine Learning

A loan model with 92% accuracy sounds great — until you notice it approves one demographic group at twice the rate of another with the same qualifications. A model can be accurate overall and still systematically unfair across groups, and in 2026 — with the EU AI Act in force for high-risk systems and fair-lending laws already on the books — fairness is an engineering responsibility, not a footnote.

Where bias comes from

Models don’t invent bias; they inherit and amplify it. The usual sources:

Biased labels / historical data. If past lending decisions were biased, a model trained to imitate them learns the bias. “Garbage in, bias out.”
Sampling bias. Under-represented groups get worse predictions simply because the model saw fewer of them.
Proxy features. Even after dropping a protected attribute, correlated features (zip code, name, device) let the model reconstruct it.
Feedback loops. A biased model’s decisions shape future data, entrenching the bias over time.

Crucially, the bias often lives in the input scores and flows straight through a single decision threshold — one cutoff, two groups, different approval rates:

Measuring fairness — pick a definition

“Fair” isn’t one thing. The two most common group-fairness metrics ask different questions:

Demographic parity — do groups get approved at the same rate? (Equal selection rate.) Ignores who’s actually qualified.
Equalized odds — among the qualified, are groups approved at the same rate (equal true-positive rate), and likewise for the unqualified? Conditions on the ground truth.

import numpy as np

# Predictions and ground truth for two groups (1 = approved / qualified).
rng = np.random.default_rng(0)
def group(n, qrate, approve_if_qualified, approve_if_not):
    qual = rng.random(n) < qrate
    pred = np.where(qual, rng.random(n) < approve_if_qualified, rng.random(n) < approve_if_not)
    return qual, pred

qa, pa = group(500, 0.5, 0.80, 0.20)   # group A
qb, pb = group(500, 0.5, 0.60, 0.15)   # group B (lower approval even when qualified)

def metrics(q, p):
    sel = p.mean()                       # selection rate (demographic parity)
    tpr = p[q].mean()                    # true-positive rate (equalized odds)
    return sel, tpr

sa, ta = metrics(qa, pa); sb, tb = metrics(qb, pb)
print(f"selection rate  A={sa:.2f}  B={sb:.2f}  -> parity gap {abs(sa-sb):.2f}")
print(f"true-pos rate   A={ta:.2f}  B={tb:.2f}  -> odds gap   {abs(ta-tb):.2f}")
print("\nBoth gaps > 0: the model is unfair on both definitions. Pick which to fix.")

selection rate  A=0.47  B=0.40  -> parity gap 0.06
true-pos rate   A=0.76  B=0.57  -> odds gap   0.18

Both gaps > 0: the model is unfair on both definitions. Pick which to fix.

The damning number is the odds gap of 0.18: among qualified applicants, Group B is approved at a true-positive rate of 0.57 versus 0.76 for Group A. The model is unfair on both definitions at once — and because the base rates and errors differ, you cannot close every gap together. Which one you close is the decision the next section forces.

Mitigation and accountability

You can intervene at three stages: pre-processing (reweight or rebalance the data), in-processing (add a fairness constraint to training), or post-processing (adjust the decision threshold per group, as the figure above hints). Microsoft’s open-source Fairlearn library implements all three plus a metrics dashboard. Pair the technical fix with model cards — short documents recording a model’s intended use, performance per group, and known limitations, which the EU AI Act increasingly expects as living evidence, not a one-time PDF.

In one breath

A model can be accurate overall and still systematically unfair — aggregate metrics hide per-group disparities, so always measure per protected group.
Bias enters through biased labels, sampling gaps, proxy features (zip code, name, device), and feedback loops — dropping the protected attribute alone doesn’t remove it.
Demographic parity asks for equal selection rates; equalized odds asks for equal error rates among the qualified — different questions that can conflict.
When base rates differ, demographic parity, equalized odds, and calibration are provably incompatible — fairness is a choice for your context, not a number to maximize.
Mitigate at pre-/in-/post-processing (e.g. Fairlearn), document in model cards, and monitor in production, because fairness drifts as data shifts.

Quick check

0/3

Q1A model has 92% overall accuracy. Can it still be unfair?

Q2What's the difference between demographic parity and equalized odds?

Q3Why can't you usually satisfy demographic parity, equalized odds, and calibration simultaneously?

Fairness joins interpretability as the responsible-ML toolkit. To audit why a model decides, revisit SHAP and SHAP vs LIME.

Fairness & bias in ML

What you'll learn

Before you start

Where bias comes from

Measuring fairness — pick a definition

Mitigation and accountability

In one breath

Quick check

Quick check

Next

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further