datarekha

Balanced Trees & B-Trees

Why a plain BST can degenerate to O(n), how self-balancing trees fix it with rotations, and why databases use B+-trees — not binary trees — for their indexes.

7 min read Advanced Data Structures & Algorithms Lesson 19 of 32

What you'll learn

  • Why a BST built from sorted input collapses to a linked list and what "balance" actually means
  • How AVL and red-black trees use rotations to maintain O(log n) height after every insert or delete
  • Why B-trees use wide fanout instead of binary splits, and how that maps nodes to disk pages
  • Why almost every SQL index is a B+-tree and why indexed lookups cost roughly 3-4 disk reads, not n

Before you start

A binary search tree gives you O(log n) search — but only when the tree is balanced. Feed it the wrong data and that guarantee evaporates entirely.

The degeneration problem

Build a BST by inserting 1, 2, 3, 4, 5 in order. Each new value is larger than everything already in the tree, so every insert goes to the rightmost position. The result looks like this:

1
 \
  2
   \
    3
     \
      4
       \
        5

That is a linked list with extra steps. Searching for 5 visits every single node. Height = n, search = O(n). The BST property is intact but the performance guarantee is gone.

This is not a contrived edge case. Sorted data, reverse-sorted data, and many real-world access patterns produce exactly this shape.

The fix is to keep the tree height-balanced: for n nodes, height stays at roughly log₂ n. That means at most ~17 levels for a million-node tree instead of a million levels.

Self-balancing BSTs

A self-balancing BST is one that automatically adjusts its shape after each insert or delete to keep height close to log₂n. It does this by enforcing an additional invariant on top of the BST property. Every insert and delete can trigger a rotation — a local restructuring that changes the shape without breaking the BST ordering property.

What a rotation does

Imagine this subtree (values satisfy BST order throughout):

    y
   / \
  x   C
 / \
A   B

A right rotation at y produces:

  x
 / \
A   y
   / \
  B   C

The BST property is preserved: everything in A < x, everything in B is between x and y, everything in C > y. But the height of this subtree dropped by one. A left rotation is the mirror image. That is the entire mechanical idea — no recopying, no rebuilding, just relinking three pointers.

AVL trees

AVL trees maintain a strict invariant: for every node, the height of its left and right subtrees may differ by at most 1. After each insert or delete, the tree walks back up to the root, checks the balance factor at each node, and fires one or two rotations wherever the invariant is violated. Height stays within a constant factor of log n; all operations remain O(log n).

The tradeoff is bookkeeping cost: AVL trees rebalance aggressively, which means more rotations on insert-heavy workloads.

Red-black trees

Red-black trees use a looser invariant: nodes are colored red or black, and a set of coloring rules ensures no path from root to leaf is more than twice as long as any other. This relaxes AVL’s strictness but reduces the average number of rotations per insert.

The practical upshot: red-black trees back most standard-library ordered collections. C++ std::map and Java TreeMap both use red-black trees internally. Rust BTreeMap uses a B-tree (a different balanced structure covered below). All give O(log n) insert, delete, and lookup with zero configuration.

Python has no built-in balanced BST. For ordered data you reach for bisect (binary search on a sorted list), dict (hash map, O(1) average but unordered), or the third-party sortedcontainers.SortedList which achieves O(log n) in practice.

Seeing degeneration in code

The playground below inserts 1 through 15 into a plain BST in sorted order, then measures the resulting height. It also computes the ideal balanced height — log₂(15) — so you can see exactly how far the naive tree has drifted.

B-trees: trading depth for width

Self-balancing BSTs solve the degeneration problem in memory. But databases have a different constraint: data lives on disk, and a disk read (even an SSD seek) costs orders of magnitude more than a memory access. You want to minimize the number of disk reads, not just the number of comparisons.

A binary tree of a million nodes has height ~20. Each node on a different disk page means up to 20 reads to find one record. B-trees solve this by increasing fanout: instead of two children per node, a B-tree node holds hundreds of keys and hundreds of child pointers. A node fills exactly one disk page.

         [30 | 70]
        /     |     \
 [10|20]   [40|50|60]   [80|90]

This is a tiny B-tree with fanout 3. Real database indexes use fanout in the hundreds. A tree of height 3 or 4 can index tens of millions of rows, and each level corresponds to exactly one disk read.

B+-trees specifically

Databases almost always use the B+-tree variant, where:

  • All actual data records (or row pointers) live only in leaf nodes.
  • Internal nodes hold only keys and child pointers — they act as a routing layer.
  • Leaf nodes are linked in a sorted chain, making range scans efficient: find the start key, then walk the leaf chain without going back up the tree.

This separation keeps internal nodes compact (more keys per page, less I/O for traversal) and makes range queries — WHERE price BETWEEN 10 AND 50 — cheap.

Quick check

Quick check

0/4
Q1You insert the values 10, 20, 30, 40, 50 into a plain BST in that order. What is the height of the resulting tree?
Q2A rotation in a self-balancing BST changes the tree's shape. What property does it always preserve?
Q3A database index on a table with 10 million rows requires roughly 3-4 disk reads to find a record. The main reason is:
Q4You build a balanced BST with 1,000 nodes and verify its height is 10. You then insert one more node. Without rebalancing, is the tree still guaranteed to be balanced?

Practice this in an interview

All questions
How does a B-tree index work, and when does the database choose not to use it?

A B-tree index stores key values in a balanced tree of sorted nodes, allowing the engine to reach any value in O(log n) page reads instead of scanning every row. The optimizer skips the index when the estimated cost of random I/O exceeds a full-table scan, when a function wraps the indexed column, or when the query returns such a large fraction of rows that a sequential scan is cheaper.

Return the level-order (BFS) traversal of a binary tree as a list of lists, one per level.

Use a queue (deque) to process nodes layer by layer. At each step, snapshot the current queue length to know exactly how many nodes belong to the current level, drain those, then enqueue their children. The result is a list of lists without any depth-tracking variable.

Search for a target in a rotated sorted array in O(log n).

Even after rotation, one of the two halves around mid is always fully sorted. Check which half is sorted, then decide whether the target falls inside it. If yes, narrow to that half; if no, search the other. This keeps binary search's O(log n) guarantee.

Implement binary search correctly — and explain the off-by-one traps.

Binary search halves the search space each iteration to find a target in O(log n). The tricky part is not the idea but the boundary conditions: closed vs. half-open intervals, how to update lo/hi, and when to use lo < hi vs. lo <= hi. One clean template eliminates all the classic bugs.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Explore further

Related lessons

Skip to content