datarekha

Approximate Inference: Sampling

When exact inference is too slow, draw a few thousand samples from the Bayes net and estimate the answer. Rejection, likelihood-weighting, and Gibbs — three classic recipes.

7 min read Intermediate GATE DA Lesson 107 of 122

What you'll learn

  • Why sampling — exact inference (VE / enumeration) gets too expensive on large networks
  • Rejection sampling: draw full samples, discard those that don't match the evidence
  • Likelihood weighting: fix the evidence, weight each sample by the product of evidence CPT entries
  • Gibbs sampling: an MCMC method that resamples one variable at a time given the others
  • All three sampling methods are APPROXIMATE; only VE / enumeration give exact answers

Before you start

Variable elimination is precise but it can get slow when a network grows large and tangled. The escape hatch: stop computing and start sampling. Draw a pile of full assignments from the net’s joint distribution, count what fraction matched your query, and report that as the answer.

The three classic recipes — rejection sampling, likelihood weighting, and Gibbs sampling — all do this, with different trade-offs. The headline you must remember for GATE: all three are approximate. They get closer to the true posterior as you draw more samples, but for any finite sample size they’re estimates. This trade — accept approximation to stay tractable — is exactly why modern Bayesian ML leans on sampling: MCMC and its relatives are what tools like Stan and PyMC run to fit models too large for exact inference.

The three methods, at a glance

Rejection sampling. Draw a full sample from the prior — sample each node in topological order from its CPT. If the sample’s evidence values match the observed ones, keep it; otherwise throw it away. Estimate the posterior from the kept samples. Simple, but wasteful: when the evidence is rare, almost every sample is thrown away.

Likelihood weighting. Fix the evidence variables to their observed values (don’t sample them). Sample the non-evidence variables from their CPTs. Weight each sample by the product of the evidence-node CPT entries: w = ∏ᵢ P(Eᵢ = eᵢ | Parents(Eᵢ)). No sample is wasted, but variance grows when you condition on many evidence nodes.

Gibbs sampling (MCMC). Start anywhere consistent with the evidence. Repeatedly pick a non-evidence variable and resample it given the current values of all the others (its Markov blanket). The chain converges to the true posterior in the limit. Handles rare evidence; needs burn-in.

100%0%~20% keptRejectionmost thrown awayall weightedLikelihood weightingno waste; weights varyall used (post burn-in)Gibbs (MCMC)converges in the limit
Rejection throws most samples away when evidence is rare. Likelihood weighting and Gibbs use every sample.

Worked example — one likelihood-weighting weight

A 3-node net A → B → E. CPTs: P(A=1) = 0.5, P(B=1 | A=1) = 0.6, P(B=1 | A=0) = 0.2, P(E=T | B=1) = 0.9, P(E=T | B=0) = 0.3. The evidence is E = T. We sample A = T from P(A), then B = F from P(B | A=T). What is the weight of this sample?

In likelihood weighting, the evidence is fixed — we don’t sample it. The weight is the product of the CPT entries for each evidence node, evaluated at its observed value given the parents that this sample chose.

Here only E is evidence, with parent B. This sample set B = F, so:

w = P(E = T | B = F) = 0.3

That’s it — one factor per evidence node. The sample {A=T, B=F, E=T} is recorded with weight 0.3 and contributes 0.3 to the running tally for queries involving A=T, B=F. Posterior estimates use the weighted counts:

P(Q = q | E = e)  ≈   Σ samples where Q=q  w_sample
                      ──────────────────────────────
                          Σ all samples  w_sample

This pattern — compute the weight as the product of evidence CPT entries — is the GATE DA 2024 Q32 shape.

How GATE asks this

Two recurring patterns. MCQ: a list of methods, asked to classify each as exact or approximate — variable elimination is the only exact one; rejection, likelihood weighting, and Gibbs are approximate. MSQ: properties of the sampling methods — rejection wastes samples when evidence is rare; likelihood weighting weights by the evidence-CPT product and wastes none; Gibbs resamples one variable at a time. GATE DA 2025 and 2026 both ran this MCQ.

Quick check

Quick check

0/6
Q1Which of the following Bayes-net inference methods are APPROXIMATE (i.e., not exact)?
Q2In a net A → B → E with P(E=T | B=1)=0.9 and P(E=T | B=0)=0.3, evidence E=T. A likelihood-weighting sample produces A=F, B=T. What weight does this sample receive? (2 decimals)numerical answer — type a number
Q3Which statements about likelihood weighting are TRUE? (select all that apply)select all that apply
Q4Which statements about rejection sampling are TRUE? (select all that apply)select all that apply
Q5Which statements about Gibbs sampling on a Bayes net are TRUE? (select all that apply)select all that apply
Q6Pick the single inference method that gives an EXACT posterior on a Bayes net (up to floating-point error).

Practice this in an interview

All questions
What are the main sampling methods and how can sampling introduce bias?

The main probability sampling methods are simple random sampling, stratified sampling, cluster sampling, and systematic sampling. Bias enters when some units have a zero or systematically different probability of selection — as in convenience sampling, survivorship bias, or non-response bias — making the sample unrepresentative of the target population regardless of size.

What is bootstrapping, and when should you use resampling methods?

Bootstrapping estimates the sampling distribution of a statistic by repeatedly drawing samples with replacement from the observed data and computing the statistic on each resample. It works when the analytic sampling distribution is unknown, intractable, or the sample size is too small for asymptotic approximations to hold.

When should you use grid search vs random search vs Bayesian optimisation for hyperparameter tuning?

Grid search exhaustively tries every combination in a predefined grid, which is only practical for 1–2 hyperparameters. Random search samples combinations uniformly at random and finds good values faster per compute budget, especially when only a few hyperparameters actually matter. Bayesian optimisation fits a surrogate model of the objective and proposes the next trial intelligently, giving the best sample efficiency for expensive evaluations.

What are the differences between batch, online, and streaming inference, and when should you use each?

Batch inference runs predictions on large datasets on a schedule, optimizing for throughput. Online inference serves individual requests in real time, optimizing for low latency. Streaming inference processes continuous event streams with bounded latency requirements between the two extremes.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Explore further

Related lessons

Skip to content