Exact Inference: Variable Elimination
Want the exact posterior on a Bayes net? Multiply the CPTs together and sum out the variables you don't care about, one at a time. That's variable elimination.
What you'll learn
- Variable elimination computes EXACT conditional probabilities on a Bayes net
- The procedure: factor the joint into CPTs, sum out hidden variables one by one, normalise
- How to do one elimination step on a small chain net
- Classify inference methods as exact (VE, enumeration) vs approximate (rejection / likelihood-weighting / Gibbs)
Before you start
You’ve drawn the Bayes net and written out the CPTs. Now somebody asks the practical question: given a few observed values, what’s the probability of another one? Variable elimination is the routine that gives you the exact answer — no sampling, no approximation — by sweeping the unwanted variables out of the joint, one at a time.
It’s slower than sampling on giant networks, but precise. For exam-sized nets (2-4 nodes), it’s the right tool and one elimination step is usually all GATE asks you to perform. The same sum-product engine runs inside probabilistic programming libraries and diagnosis systems whenever an exact answer is affordable — so the hand-trace you practise here is a scaled-down version of what those tools do.
The procedure
To compute P(Query | Evidence):
- Write the joint as a product of CPT factors — one factor per node, from the Bayes-net factorisation.
- Fix the evidence variables to their observed values.
- Pick a hidden (non-query, non-evidence) variable and sum it out: replace every factor that contains it with one factor that’s the sum over that variable’s values.
- Repeat until only the query variable remains.
- Normalise so the surviving distribution sums to 1.
The answer is exact (up to floating-point error). That’s the headline property — VE is not a sampling method.
Worked example — eliminate one variable
A chain
A → B → C. GivenP(A=1) = 0.5,P(B=1 | A=1) = 0.7,P(B=1 | A=0) = 0.2,P(C=1 | B=1) = 0.9,P(C=1 | B=0) = 0.4. FindP(C=1)by summing out A and B.
Factorisation: P(A, B, C) = P(A) · P(B | A) · P(C | B). We need the marginal
P(C=1) = Σ_A Σ_B P(A) · P(B | A) · P(C=1 | B).
Step 1 — eliminate A. Combine P(A) · P(B | A) into a factor over B:
P(B=1) = P(B=1 | A=1)·P(A=1) + P(B=1 | A=0)·P(A=0)
= 0.7 · 0.5 + 0.2 · 0.5
= 0.35 + 0.10 = 0.45
P(B=0) = 1 − 0.45 = 0.55
Step 2 — eliminate B. Combine the new P(B) with P(C=1 | B):
P(C=1) = P(C=1 | B=1)·P(B=1) + P(C=1 | B=0)·P(B=0)
= 0.9 · 0.45 + 0.4 · 0.55
= 0.405 + 0.220 = 0.625
So P(C=1) = 0.625 — exactly, no sampling involved. Each elimination
step is one weighted sum over the values of the variable being removed.
How GATE asks this
Two patterns. MSQ: which of the listed methods compute exact posteriors on a Bayes net? Variable elimination yes; enumeration of the joint yes; rejection / likelihood-weighting / Gibbs no — they’re sampling. NAT: perform one elimination step on a 3-node chain or v-structure and report the marginal. GATE DA 2025 ran an MSQ asking exactly this classification.
Method-classification cheat sheet:
| Method | Type |
|---|---|
| Variable elimination | Exact |
| Enumeration / brute-force joint | Exact |
| Rejection sampling | Approximate |
| Likelihood weighting | Approximate |
| Gibbs sampling (MCMC) | Approximate |
Quick check
Quick check
Practice this in an interview
All questionsThe law of total probability decomposes P(A) over a mutually exclusive, exhaustive partition of the sample space: P(A) = Σ P(A|Bᵢ)·P(Bᵢ). It is the engine behind the Bayes denominator and any calculation where you want an overall rate built from segment-level rates.
MLE maximises the likelihood of the data alone; MAP (Maximum A Posteriori) adds a prior over parameters and maximises the posterior, making it equivalent to regularised MLE. Frequentists treat parameters as fixed unknowns; Bayesians treat them as random variables with a prior distribution.
Bayes' theorem updates a prior probability with new evidence: P(H|E) = P(E|H) P(H) / P(E). In disease testing, ignoring the low base rate (prior) makes a positive test look far more alarming than it really is — most positives are false positives when the disease is rare.
Ridge regression is equivalent to maximum a posteriori (MAP) estimation with a zero-mean Gaussian prior on the coefficients. The regularization strength λ corresponds to the ratio of the noise variance to the prior variance — stronger regularization means you believe coefficients are drawn from a tighter distribution around zero.