How does a B-tree index work, and when does the database choose not to use it?

A B-tree index stores key values in a balanced tree of sorted nodes, allowing the engine to reach any value in O(log n) page reads instead of scanning every row. The optimizer skips the index when the estimated cost of random I/O exceeds a full-table scan, when a function wraps the indexed column, or when the query returns such a large fraction of rows that a sequential scan is cheaper.

What is a covering index and how does it eliminate heap fetches?

A covering index includes every column a query needs — both filter and select columns — so the engine can answer the query entirely from the index pages without touching the main table heap. This removes the costliest part of an index scan: the random I/O for each individual row fetch.

How does columnar storage work, and how does partitioning improve query performance in a data warehouse?

Columnar storage colocates values from the same column on disk, so aggregation queries read only the columns they need rather than full rows — dramatically reducing I/O on wide tables. Partitioning physically separates data into subdirectories (e.g., by date), allowing the query engine to skip entire partitions whose predicate cannot match, cutting scan volume from the full table to just the relevant slice.

What is vectorless retrieval (PageIndex), and when would you use it over a vector database?

Vectorless retrieval skips embeddings entirely: it organizes a document into a hierarchical tree (titles, summaries, page ranges) and the LLM reasons a path down it — root to chapter to section — then reads only the chosen section to answer. It is structure-aware and explainable, but it spends an LLM call at each hop, so it suits a small number of well-structured documents. A vector database is the opposite trade: one millisecond ANN lookup that scales to millions of chunks but is flat and blind to document structure. Use vectors for large, messy corpora and speed; use PageIndex for bounded structured docs where the answer is found by reasoning about where it lives; combine them by shortlisting with vectors then navigating within a document.

File Organization & Indexing — GATE DA

Last lesson handed us a normalized schema and a new worry: the rows still have to live somewhere physical, and the database still has to find the one you asked for fast. So imagine a phonebook with a million names printed in random order. Finding “Sharma” means flipping every single page until you hit it. That is what the database does too, by default — and on a ten-million-row table it is unbearable.

Now imagine the same phonebook with the alphabetical tabs cut into the page edges. You press straight to the S section and skip the rest. That little side-structure — separate from the names themselves, just a map into them — is the whole idea. Disk pages are slow to read, so anything that lets the engine jump to the right page instead of scanning all of them is an enormous win.

A database keeps the rows in a file, and builds one or more indexes on top of that file. The index is the cut tabs; the file is the pages. Picking the right kind of each, for the queries you actually run, is what this lesson is about — and it is the same choice GATE keeps asking, year after year.

Files first — heap or sorted

Before any index exists, your rows already live in a file, laid out one of two ways.

Heap file — rows in insertion order. Inserts are cheap: append to the end. Lookups are not: you scan the whole thing.
Sorted file — rows kept ordered by some column. Lookups use binary search and fly. Inserts hurt: holding the order means shifting rows to make room.

Most real tables sit in heaps and lean on indexes for their speed, precisely because inserts into a sorted file are so costly.

Primary, clustering, secondary — does the file move?

This is the distinction that trips people. Ask two questions of any index.

Does the file itself reorder to follow it?
Is the indexed column the primary key, or some other column?

Primary/clustering reorders the file; secondary indexes point in from the side.

Primary index — built on the primary key, and the file is sorted by it. One per table.
Clustering index — same idea (the file is sorted by it), but the column need not be a key. Still at most one per table.
Secondary index — built on any other column, leaving the data file untouched. You can have as many as you like.

And one more axis, usually paired with primary/clustering indexes:

Dense index — one entry per row. Bigger, but it locates any row directly.
Sparse index — one entry per block (page). Smaller, but once it lands you on a block you scan inside it. A sparse index only works on a sorted file (primary or clustering), since it relies on the order.

Hash vs B+-tree — the real exam question

Once you have chosen what to index, you choose how to store the index. Two big families, and the gap between them is the whole question.

Hash buckets give O(1) for equality but cannot scan ranges. B+-tree leaves are sorted and chained — perfect for range scans and ORDER BY.

Hash index. Hash the key, jump to its bucket, fetch. Expected O(1) for WHERE x = 19. But the buckets sit in no particular order, so for WHERE x BETWEEN 10 AND 30 you would have to probe every possible value in the range one at a time — no better than having no index. The same goes for ORDER BY x.
B+-tree index. A balanced multi-way search tree whose leaves hold the keys in sorted order and are chained left to right. WHERE x = 19 is O(log n). For a range, you descend once to the bottom of the range and then simply walk the leaf chain. ORDER BY x comes free, because the leaves are already sorted. This is why almost every default database index is a B+-tree.

Worked example — GATE DA 2024 Q45

A query you have seen, and will see again:

SELECT * FROM T WHERE x BETWEEN 10 AND 20 ORDER BY x;

You may build one index on column x. Hash or B+-tree?

A hash index on x answers exact equality, but for BETWEEN 10 AND 20 the engine would have to hash every value from 10 to 20 (and only if x is integer) or skip the index entirely. And ORDER BY x cannot use a hash index at all — buckets carry no order. So hash is useless here.
A B+-tree on x descends to the leaf holding 10, then walks right along the sorted leaf chain through 11, 12, …, 20. Each value’s row arrives already in x order, so the ORDER BY costs nothing extra.

B+-tree wins. Any query that mixes a range or an ordering with the indexed column gives the same verdict. That is the design call GATE DA 2024 Q45 posed — and the same call a senior engineer makes on the job.

How GATE asks this

A short MCQ or MSQ. The format is “given query X, which index is best?” or “which of these statements about [primary / secondary / dense / sparse] indexes are true?” The decision tree is short — equality only → hash is fine; range or ordering → B+-tree; reorder the file → primary or clustering; a column other than the key → secondary. Walk that tree and you will be right.

In one breath

Rows live in a file (a heap unless you sort it), and an index is a small side-map that spares the engine a full scan; a primary/clustering index reorders the file and there is at most one, while secondary indexes leave the file alone and you may have many. Store the index as a hash when every lookup is exact equality (O(1) but order-blind) and as a B+-tree when any query needs a range or an ORDER BY (O(log n) with sorted, chained leaves) — which is why the B+-tree is the everyday default.

Practice

Quick check

0/5

Q1Recall — Which statements about indexes are TRUE? (select all that apply)select all that apply

Q2Recall — Which statements about primary vs secondary, and dense vs sparse, are TRUE? (select all that apply)select all that apply

Q3Apply — Your only query is `SELECT * FROM Users WHERE user_id = ?` (always exact equality, never a range). Which index is the best fit?

Q4Apply — You can build ONE index on column `salary` of `Employees`. The query is `SELECT * FROM Employees WHERE salary BETWEEN 50000 AND 80000 ORDER BY salary`. Which index type wins?

Q5Create — A `Books(book_id PK, author, title, year)` table holds 1 million rows in a heap. The query `SELECT title FROM Books WHERE author = 'Tagore' AND year = 1913` runs a hundred times per second. Which choice is the best fit?

A question to carry forward

So a single, well-designed database can now hold its rows cleanly and find any one of them in a heartbeat. That is the operational world: one tidy schema, serving live reads and writes.

But the moment you want to analyse data rather than merely serve it, the neat picture breaks. The numbers you need are scattered across a dozen such databases, plus spreadsheets, plus log files — each with its own column names, its own date formats, its own idea of what “null” means. Before any of it can be queried together, it has to be pulled in, cleaned, and reshaped into one consistent table. Here is the thread onward: what are the standard moves for turning raw, messy, multi-source records into tidy, analysis-ready data — and which of them does GATE expect you to perform by hand?

File Organization & Indexing

What you'll learn

Before you start

Files first — heap or sorted

Primary, clustering, secondary — does the file move?

Hash vs B+-tree — the real exam question

Worked example — GATE DA 2024 Q45

How GATE asks this

In one breath

Practice

Quick check

A question to carry forward

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further