Graphs for ML & Knowledge Graphs
How the graph abstractions from this chapter — nodes, edges, BFS, adjacency — quietly power knowledge graphs, node embeddings, GNNs, and GraphRAG in production AI systems.
What you'll learn
- How knowledge graphs encode facts as subject–predicate–object triples and why that structure lets LLMs reason over structured data
- What node2vec / DeepWalk do — random walks treated as sentences, then word2vec-style training to produce node embeddings
- The message-passing intuition behind Graph Neural Networks — each node aggregates its neighbors, stacking k layers reaches k hops
- How PageRank's random-surfer model defines node importance through power iteration on the adjacency structure
Before you start
Every algorithm in this chapter — BFS, DFS, adjacency lists, edge traversal — was introduced as a tool for solving graph puzzles. It turns out those same primitives are the foundation of some of the most commercially valuable ML systems running today. This lesson maps the DSA concepts you already know onto their ML counterparts.
Knowledge graphs: facts as triples
A knowledge graph stores the world as a set of triples: (subject, predicate, object).
(Paris, capital_of, France)
(Marie Curie, won, Nobel Prize in Physics)
(Aspirin, treats, Headache)
Nodes are entities (people, places, concepts, drugs). Edges are typed relations — the predicate gives the edge a label, which is richer than an unweighted adjacency list. In graph terms, you already know this structure: it is a directed, labeled multigraph.
Why LLMs care
A large language model generates text from statistical patterns. It has no reliable way to know whether a specific drug–drug interaction exists, or who the current CEO of a company is. Grounding the model in a knowledge graph solves this.
GraphRAG (Graph Retrieval-Augmented Generation) is the pattern: instead of retrieving a flat list of text chunks, the system traverses a subgraph around the query entities and passes that structured context to the model. BFS from a query node out to depth 2 or 3 retrieves the relevant neighborhood. The model gets structured, verifiable facts rather than potentially contradicting prose.
The retrieval step is exactly neighborhood traversal on an adjacency structure — the same operation as printing all nodes reachable within k hops.
Node embeddings: word2vec for graphs
Word2vec learns vector representations of words by training a shallow neural network to predict a word from its context words in a sentence. The insight behind DeepWalk and node2vec is: generate “sentences” over the graph using random walks, then apply word2vec.
The procedure in plain terms:
- For each node, run many random walks of length L, following edges at random (node2vec biases the walk toward BFS-like or DFS-like behavior via two hyperparameters).
- Treat each walk as a sentence, each node ID as a token.
- Train a skip-gram model: predict which nodes appear near a given node across all walks.
- The hidden-layer weights become the node embeddings.
Nodes that are structurally close or share similar neighborhoods end up with similar vectors. You can now compute cosine similarity between nodes, cluster them, or feed the embeddings into a downstream classifier — all without ever looking at the raw graph.
The key connection: a random walk is a probabilistic traversal of the adjacency list. It uses the same data structure you have already built.
Graph Neural Networks: message passing
A GNN generalizes the idea of “aggregate your neighbors” into a learnable operation.
Each node starts with a feature vector (e.g., a one-hot encoding, or an embedding). One round of message passing works like this:
- Every node sends its current vector to each of its neighbors.
- Each node collects all received vectors and aggregates them — mean, sum, or max.
- Each node updates its own vector using its current vector and the aggregated result (through a learned linear layer + activation).
Stack k such layers and the node’s vector encodes information from all nodes within k hops. A node 3 hops away influences the representation after 3 rounds — exactly the same radius concept as BFS depth.
This is what makes GNNs powerful for molecular graphs (atoms as nodes, bonds as edges), social graphs, and recommendation graphs. The architecture mirrors the graph’s structure directly.
PageRank: importance flows along links
PageRank assigns a score to every node by modeling a random surfer: at each step, the surfer follows a random outgoing edge with probability d (the damping factor, typically 0.85) or teleports to a random node with probability 1 - d. The teleportation term is necessary because nodes with no outgoing edges — “dangling” nodes — would otherwise drain rank out of the graph entirely and the scores would not converge to a meaningful distribution.
The steady-state probability of being at a node is its PageRank score. The iterative update rule for node v is:
PR(v) = (1 - d) / N + d * sum( PR(u) / out_degree(u) for u in in_neighbors(v) )
Running this update for all nodes repeatedly — a technique called power iteration, where you apply the same linear rule until scores stop changing — converges in roughly 20–50 iterations on most graphs. High-in-degree nodes from high-PageRank sources receive more of the “importance flow.”
In adjacency-list terms: for each node, iterate over its in-neighbors, accumulate their current scores divided by their out-degree, scale by d, add the teleport term. One round is O(V + E) — a single pass over the edge list.
The message-passing kernel, live
This playground runs one round of neighbor-feature aggregation on a small graph. Each node has a one-dimensional feature. The update rule is: new feature = mean of neighbors’ features. That is the simplest possible GNN layer.
Watch the values converge toward the graph’s average over rounds. In a real GNN the aggregation is a learned weighted sum and the update applies a nonlinear transformation — but the traversal logic is identical to what you see here.
Quick check
Quick check
Practice this in an interview
All questionsAn embedding is a dense, learned vector representation of a discrete or high-dimensional object — a word, image, user, product — in a continuous low-dimensional space. Proximity in embedding space reflects semantic or behavioural similarity, making embeddings a universal interface between raw data and neural networks.
RAG couples a retrieval step — fetching relevant documents from an external store — with a generative model so the LLM can answer questions about knowledge it was never trained on. It solves the stale-knowledge and hallucination problems without retraining. The pattern is preferred when the knowledge base changes frequently or contains proprietary data.
Deep learning wins when data is abundant, inputs are unstructured (images, text, audio), and features are hard to engineer by hand. Classical ML wins on structured tabular data, small datasets, and when interpretability or training speed matter.
Production ML monitoring spans four layers: data quality (schema, distributions, null rates), model behaviour (prediction drift, confidence calibration), operational health (latency, error rate, throughput), and business KPIs (conversion, revenue impact). Each layer has different owners and different alert thresholds.