Approximate Inference: Sampling
When exact inference is too slow, draw a few thousand samples from the Bayes net and estimate the answer. Rejection, likelihood-weighting, and Gibbs — three classic recipes.
What you'll learn
- Why sampling — exact inference (VE / enumeration) gets too expensive on large networks
- Rejection sampling: draw full samples, discard those that don't match the evidence
- Likelihood weighting: fix the evidence, weight each sample by the product of evidence CPT entries
- Gibbs sampling: an MCMC method that resamples one variable at a time given the others
- All three sampling methods are APPROXIMATE; only VE / enumeration give exact answers
Before you start
Variable elimination is precise but it can get slow when a network grows large and tangled. The escape hatch: stop computing and start sampling. Draw a pile of full assignments from the net’s joint distribution, count what fraction matched your query, and report that as the answer.
The three classic recipes — rejection sampling, likelihood weighting, and Gibbs sampling — all do this, with different trade-offs. The headline you must remember for GATE: all three are approximate. They get closer to the true posterior as you draw more samples, but for any finite sample size they’re estimates. This trade — accept approximation to stay tractable — is exactly why modern Bayesian ML leans on sampling: MCMC and its relatives are what tools like Stan and PyMC run to fit models too large for exact inference.
The three methods, at a glance
Rejection sampling. Draw a full sample from the prior — sample each node in topological order from its CPT. If the sample’s evidence values match the observed ones, keep it; otherwise throw it away. Estimate the posterior from the kept samples. Simple, but wasteful: when the evidence is rare, almost every sample is thrown away.
Likelihood weighting. Fix the evidence variables to their observed values
(don’t sample them). Sample the non-evidence variables from their CPTs. Weight
each sample by the product of the evidence-node CPT entries:
w = ∏ᵢ P(Eᵢ = eᵢ | Parents(Eᵢ)). No sample is wasted, but variance grows
when you condition on many evidence nodes.
Gibbs sampling (MCMC). Start anywhere consistent with the evidence. Repeatedly pick a non-evidence variable and resample it given the current values of all the others (its Markov blanket). The chain converges to the true posterior in the limit. Handles rare evidence; needs burn-in.
Worked example — one likelihood-weighting weight
A 3-node net
A → B → E. CPTs:P(A=1) = 0.5,P(B=1 | A=1) = 0.6,P(B=1 | A=0) = 0.2,P(E=T | B=1) = 0.9,P(E=T | B=0) = 0.3. The evidence isE = T. We sampleA = TfromP(A), thenB = FfromP(B | A=T). What is the weight of this sample?
In likelihood weighting, the evidence is fixed — we don’t sample it. The weight is the product of the CPT entries for each evidence node, evaluated at its observed value given the parents that this sample chose.
Here only E is evidence, with parent B. This sample set B = F, so:
w = P(E = T | B = F) = 0.3
That’s it — one factor per evidence node. The sample {A=T, B=F, E=T} is
recorded with weight 0.3 and contributes 0.3 to the running tally for
queries involving A=T, B=F. Posterior estimates use the weighted counts:
P(Q = q | E = e) ≈ Σ samples where Q=q w_sample
──────────────────────────────
Σ all samples w_sample
This pattern — compute the weight as the product of evidence CPT entries — is the GATE DA 2024 Q32 shape.
How GATE asks this
Two recurring patterns. MCQ: a list of methods, asked to classify each as exact or approximate — variable elimination is the only exact one; rejection, likelihood weighting, and Gibbs are approximate. MSQ: properties of the sampling methods — rejection wastes samples when evidence is rare; likelihood weighting weights by the evidence-CPT product and wastes none; Gibbs resamples one variable at a time. GATE DA 2025 and 2026 both ran this MCQ.
Quick check
Quick check
Practice this in an interview
All questionsThe main probability sampling methods are simple random sampling, stratified sampling, cluster sampling, and systematic sampling. Bias enters when some units have a zero or systematically different probability of selection — as in convenience sampling, survivorship bias, or non-response bias — making the sample unrepresentative of the target population regardless of size.
Bootstrapping estimates the sampling distribution of a statistic by repeatedly drawing samples with replacement from the observed data and computing the statistic on each resample. It works when the analytic sampling distribution is unknown, intractable, or the sample size is too small for asymptotic approximations to hold.
Grid search exhaustively tries every combination in a predefined grid, which is only practical for 1–2 hyperparameters. Random search samples combinations uniformly at random and finds good values faster per compute budget, especially when only a few hyperparameters actually matter. Bayesian optimisation fits a surrogate model of the objective and proposes the next trial intelligently, giving the best sample efficiency for expensive evaluations.
Batch inference runs predictions on large datasets on a schedule, optimizing for throughput. Online inference serves individual requests in real time, optimizing for low latency. Streaming inference processes continuous event streams with bounded latency requirements between the two extremes.