Explain joint, marginal, and conditional distributions and how to move between them.

The joint distribution P(X, Y) fully specifies two random variables together. Marginals P(X) and P(Y) are obtained by summing (or integrating) the joint over the other variable. Conditionals P(X|Y=y) are the joint sliced at a fixed y value, renormalized by the marginal P(Y=y).

What is conditional probability, and how does it differ from joint probability?

Conditional probability P(A|B) is the probability of A given that B has already occurred, computed as P(A and B) / P(B). It narrows the sample space to B, whereas joint probability P(A and B) lives in the full, unrestricted space.

When does each common distribution arise — Bernoulli, Binomial, Poisson, Normal, Exponential, Uniform?

Each distribution has a natural generative story: Bernoulli is a single coin flip; Binomial sums Bernoullis; Poisson counts rare arrivals; Normal emerges from sums of many small effects; Exponential models waiting times between Poisson events; Uniform assigns equal probability across a range. Choosing correctly comes from matching that story to the data-generating process.

State the law of total probability and give a concrete example of when you'd apply it.

The law of total probability decomposes P(A) over a mutually exclusive, exhaustive partition of the sample space: P(A) = Σ P(A|Bᵢ)·P(Bᵢ). It is the engine behind the Bayes denominator and any calculation where you want an overall rate built from segment-level rates.

Bayesian Networks & Joint Factorization — GATE DA

What you'll learn

A Bayesian network is a DAG where each node carries a conditional probability table (CPT) given its parents

The joint factorises as a product of CPTs — one factor per node, conditioned only on its parents

Conditional independence: every node is independent of its non-descendants given its parents

Posterior inference on a tiny net is just Bayes' theorem with the joint expanded by the factorisation

Last lesson left logic stranded on the shore of certainty, with no word for “usually” and no way to hold a belief at degree 0.9 — exactly what the real world demands. Here is the tool for reasoning in degrees. Diseases cause symptoms; symptoms cause test results. A Bayesian network is a clean way to draw who influences whom, and then ask probabilistic questions on top of the picture.

Each node is a random variable; each arrow says “this directly influences that.” Attach one small probability table per node — its conditional probability table (CPT) — and you have quietly specified the entire joint distribution, in a fraction of the numbers it would take to list the joint outright. That compactness is the whole point: it is the model behind real diagnostic systems (medical risk, equipment fault-finding) and the probabilistic graphical models used across ML to reason under uncertainty, and it is what makes such reasoning tractable at all.

The DAG and its CPTs

A Bayesian network is a directed acyclic graph (DAG) in which every node X carries a CPT P(X | Parents(X)). No cycles, no self-influence — just parents pointing to children.

Three nodes, three CPTs. The joint is the product — that’s the factorisation.

The Alarm node needs a CPT with one row per parent combination — four rows, one for each setting of (B, E):

B	E	P(A=1 \| B, E)
1	1	0.95
1	0	0.94
0	1	0.29
0	0	0.001

Alarm’s CPT: one row per parent assignment. The rows need not sum to 1 — each row is its own distribution over A.

The joint factorisation

For any Bayes net on n variables:

P(X₁, X₂, …, Xₙ) = ∏ᵢ P(Xᵢ | Parents(Xᵢ))

That is the headline. Instead of one giant table over all 2ⁿ joint outcomes, you store one small table per node — conditioned only on its parents, not on every earlier variable. That is where the savings come from.

The matching independence statement: each node is independent of its non-descendants given its parents. In the alarm net, once you know B and E, the alarm is independent of any other non-descendant — its parents fully explain it.

Worked example — GATE DA 2026

A 2-node net Disease → Test with P(D) = 0.3, P(+ | D) = 0.8, P(+ | ¬D) = 0.1. The test reads positive. Compute P(D | +).

The joint factorises as P(D, T) = P(D) · P(T | D), so the joint for the observed outcome D = 1, T = + is:

P(D=1, +) = P(D) · P(+ | D) = 0.30 · 0.80 = 0.24

The posterior by Bayes — its denominator the total probability of a positive test, itself just the joint summed over D:

P(D | +) =        P(+ | D) · P(D)
            ──────────────────────────────────
            P(+ | D)·P(D) + P(+ | ¬D)·P(¬D)

         =       0.80 · 0.30                =  0.24 / 0.31  ≈  0.77
            ──────────────────────────
            0.80·0.30 + 0.10·0.70

So ≈ 0.77 — the same value as the disease/test problem from the Bayes’ theorem lesson (and GATE DA 2026, Q57). Drawing it as a network did not change the answer; the factorisation just made the joint mechanical to write down before plugging into Bayes.

Drag the prior down and watch the posterior collapse when the disease is rare — base rates dominate Bayes-net inference too:

TryBayes on people

Why a 99%-accurate test still mostly catches healthy people

Each cell is one person. Drag the sliders — watch how false positives (healthy but tested positive) can outnumber true positives when the disease is rare.

Prevalence5%1%50%Test accuracy95%50%99%

Sick + positive (TP)Sick + negative (FN)Healthy + positive (FP)Healthy + negative (TN)

P(sick | positive)50%

True positives5

False positives5

Total positives10

P(sick | positive)=5 TP5 TP + 5 FP=0.50

How GATE asks this

Usually a NAT: a small 3- or 4-node net with CPTs given, asking for the joint probability of a specific assignment (multiply down the factorisation), or a posterior of one variable given evidence on another (factorise the joint, then Bayes). MSQs ask which independence statements the DAG implies. GATE DA 2026 Q57 was the disease/positive-test posterior above (answer 0.77); GATE DA 2024 ran the same machinery on a slightly larger net. The recipe never changes: write the joint as a product of CPTs, plug in the observed values, normalise.

In one breath

A Bayesian network is a DAG whose every node carries a CPT P(X | Parents(X)), and the whole joint factorises as ∏ᵢ P(Xᵢ | Parents(Xᵢ)) — one small table per node instead of a 2ⁿ-entry monster — which encodes the independence that each node is independent of its non-descendants given its parents; you answer a query by multiplying the relevant CPT entries into the joint and applying Bayes, and the standing trap is conditioning a node on all earlier variables instead of only its parents.

Practice

Quick check

0/6

Q1Recall — Which statements about a Bayesian network on variables X₁, …, Xₙ are TRUE? (select all that apply)select all that apply

Q2Recall — Which are valid reasons to use a Bayesian network instead of storing the full joint distribution? (select all that apply)select all that apply

Q3Trace — In the Burglary → Alarm ← Earthquake net with P(B)=0.001, P(E)=0.002, P(A=1 | B=1, E=1)=0.95, compute the joint P(B=1, E=1, A=1). Enter the value in millionths (so 1.9 means 1.9 × 10⁻⁶).numerical answer — type a number

Q4Trace — In the Disease → Test net (P(D)=0.3, P(+|D)=0.8, P(+|¬D)=0.1), what is the joint P(D=1, T=+)? (3 decimals)numerical answer — type a number

Q5Trace — A 2-node net Cloudy → Rain has P(C=1) = 0.5, P(R=1 | C=1) = 0.8, P(R=1 | C=0) = 0.2. It rained. Compute P(C=1 | R=1) to 2 decimals.numerical answer — type a number

Q6Apply — For a Bayes net on 4 binary variables with no edges (all independent), how many independent parameters specify the joint? (Compare to 2⁴−1 = 15 for an unstructured joint.)numerical answer — type a number

A question to carry forward

A Bayes net stores the whole joint in a handful of small tables — wonderfully compact. But storing the joint and querying it are different acts. Ask the natural question — “given the alarm went off, what is the chance it was a burglary?” — and you must sum the joint over every variable you did not observe (here, the earthquake). For three nodes that is a tiny sum. For a net of thirty variables, the marginal you want hides inside a sum with billions of terms.

Yet stare at that sum and the same sub-expressions repeat over and over. Surely you need not expand the entire joint just to collapse it again. Here is the thread onward: is there a way to compute that marginal exactly — the true probability, no approximation — without ever writing the full joint down, by summing out one variable at a time and reusing the partial results so the work does not explode? That careful, order-sensitive summing-out is the next lesson’s algorithm.

Bayesian Networks & Joint Factorization

What you'll learn

Before you start

The DAG and its CPTs

The joint factorisation

Worked example — GATE DA 2026

Why a 99%-accurate test still mostly catches healthy people

How GATE asks this

In one breath

Practice

Quick check

A question to carry forward

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further