Conditional & Total Probability
P(A given B) rescales probability to the world where B happened. The law of total probability stitches those pieces back together — and sets up Bayes.
What you'll learn
- Conditional probability P(A|B) = P(A and B) / P(B), and what 'given' really does
- The multiplication rule P(A and B) = P(A|B)·P(B)
- The law of total probability: splitting an event across exhaustive cases
- Reading a probability tree to compute an overall probability
Before you start
It’s raining outside. You already know that — so the chance you also need an
umbrella isn’t 30% any more, it’s basically 1. Conditioning is what your
brain just did: once you learn B happened, you zoom into the slice of the world
where B is true and re-measure A inside that slice. The notation P(A | B)
(“probability of A given B”) is just a name for the move. It’s also the quantity
almost every supervised ML model is built to estimate — a spam filter learns
P(spam | the words in this email), a classifier learns P(label | features).
We’ll write down the formula, then use it to stitch overall probabilities
together from cases — that second move (the law of total probability) is the
workhorse of the lesson.
The definition
P(A | B) = P(A ∩ B) / P(B), for P(B) > 0
You restrict attention to outcomes where B holds (that’s the new denominator
P(B)), then ask what fraction of those also have A. Rearranging gives the
multiplication rule, which is often the more useful form:
P(A ∩ B) = P(A | B) · P(B)
“The chance both happen = chance B happens, times the chance A happens given B.”
Drag the circles to see this geometrically. Hit the Given B toggle and
everything outside B fades — that’s the “zoom into the world where B is true”
move. P(A | B) is then just the area of A∩B as a fraction of B’s area, exactly
what the formula says.
The law of total probability
Often you can’t get P(A) directly, but you can split the world into exhaustive,
mutually exclusive cases and find A’s probability within each. Then you recombine,
weighting by how likely each case is.
If cases B₁, B₂, … are mutually exclusive and cover everything:
P(A) = Σ P(A | Bᵢ) · P(Bᵢ)
Worked example. A factory: Machine 1 makes 60% of items with a 2% defect rate;
Machine 2 makes 40% with a 5% defect rate. The overall defect probability is
0.6·0.02 + 0.4·0.05 = 0.012 + 0.020 = 0.032 — i.e. 3.2%.
Flipping the question — given a defect, which machine did it come from? — is
Bayes’ theorem, the next lesson. For a preview: dividing each path’s
contribution by the total gives P(M2 | defective) = 0.020 / 0.032 ≈ 0.625.
Quick check
Quick check
Practice this in an interview
All questionsThe law of total probability decomposes P(A) over a mutually exclusive, exhaustive partition of the sample space: P(A) = Σ P(A|Bᵢ)·P(Bᵢ). It is the engine behind the Bayes denominator and any calculation where you want an overall rate built from segment-level rates.
Conditional probability P(A|B) is the probability of A given that B has already occurred, computed as P(A and B) / P(B). It narrows the sample space to B, whereas joint probability P(A and B) lives in the full, unrestricted space.
Bayes' theorem updates a prior probability with new evidence: P(H|E) = P(E|H) P(H) / P(E). In disease testing, ignoring the low base rate (prior) makes a positive test look far more alarming than it really is — most positives are false positives when the disease is rare.
The joint distribution P(X, Y) fully specifies two random variables together. Marginals P(X) and P(Y) are obtained by summing (or integrating) the joint over the other variable. Conditionals P(X|Y=y) are the joint sliced at a fixed y value, renormalized by the marginal P(Y=y).