File Organization & Indexing
Heap vs sorted files, primary vs secondary indexes, hash vs B+-tree — the four choices behind every fast lookup, and the one GATE tests.
What you'll learn
- Heap vs sorted files; why most tables sit in heaps until you index them
- Primary/clustering vs secondary indexes — which one reorders the file
- Dense vs sparse: one entry per row vs one per block
- Hash (O(1) equality only) vs B+-tree (O(log n) equality AND range)
Before you start
Imagine a phonebook with a million names in random order. Finding “Sharma” means flipping every page. An index is the alphabetical tab at the edge — a small side structure that says “Sharma starts on page 743.” Disk pages are slow to read, so anything that lets you jump straight to the right page is a massive win.
Databases keep the rows themselves in a file, and then build one or more indexes on top. Picking the right kind for the queries you actually run is what this lesson is about — and it’s the same choice GATE keeps asking.
Files first — heap or sorted
Before any index exists, your rows live in a file. Two basic layouts:
- Heap file — rows in insertion order. Inserts are cheap (append to the end). Lookups scan the whole thing.
- Sorted file — rows kept ordered by some column. Lookups use binary search. Inserts are painful: keeping order means shifting rows around.
Most real tables sit in heaps and lean on indexes for speed.
Primary, clustering, secondary — does the file move?
This is the bit that confuses people. Two questions to ask of any index:
- Does the file itself reorder around it?
- Is the indexed column the primary key, or some other column?
- Primary index — built on the primary key, and the file is sorted by it. One per table.
- Clustering index — same idea (file sorted by it), but the column doesn’t have to be a key. Still at most one per table.
- Secondary index — built on any other column. The data file is left alone. You can have as many as you like.
And one more axis, mostly mentioned with primary/clustering indexes:
- Dense index — one entry per row. Bigger, but you can find any row directly.
- Sparse index — one entry per block (page). Smaller, but you have to scan inside the block. Only works on a sorted file (primary or clustering).
Hash vs B+-tree — the real exam question
Once you’ve chosen what to index, you choose how to store the index. Two big families:
- Hash index. Hash the key, jump to a bucket, fetch. Expected O(1) for
WHERE x = 19. But buckets are not in any order, so forWHERE x BETWEEN 10 AND 30you would have to probe every value in that range one by one — that is no better than no index at all. Same forORDER BY x. - B+-tree index. A balanced multi-way search tree whose leaves hold the keys in sorted order and are chained left-to-right.
WHERE x = 19is O(log n). For a range, you descend once to the bottom of the range and then walk the leaf chain.ORDER BY xbecomes free — the leaves are already sorted. This is why almost every default database index is a B+-tree.
Worked example — GATE DA 2024 Q45
A query you have seen and will see again:
SELECT * FROM T WHERE x BETWEEN 10 AND 20 ORDER BY x;
You can build one index on column x. Hash or B+-tree?
- A hash index on
xlets you probe equality, but forBETWEEN 10 AND 20the engine would have to either hash every value from 10 to 20 (ifxis integer) or, more generally, skip the index entirely. AndORDER BY xcannot use a hash index — buckets have no order. So hash is useless here. - A B+-tree index on
xdescends to the leaf containing 10, then walks right through 11, 12, …, 20 along the sorted leaf chain. Each value’s row is read inxorder, so theORDER BYadds no work at all.
B+-tree wins. Any query mixing range or ordering with the indexed column gives the same answer. This is the design call GATE DA 2024 Q45 asked about — and it’s the design call every senior engineer makes on the job, too.
How GATE asks this
A short MCQ or MSQ. Format: “given query X, which index is best?” or “which of the following statements about [primary / secondary / dense / sparse] indexes are true?” The decision tree is short — equality only → hash is fine; range or ordering → B+-tree; reordering the file → primary or clustering; column other than the key → secondary. Walk that tree and you’ll be right.
Quick check
Quick check
Practice this in an interview
All questionsA B-tree index stores key values in a balanced tree of sorted nodes, allowing the engine to reach any value in O(log n) page reads instead of scanning every row. The optimizer skips the index when the estimated cost of random I/O exceeds a full-table scan, when a function wraps the indexed column, or when the query returns such a large fraction of rows that a sequential scan is cheaper.
A covering index includes every column a query needs — both filter and select columns — so the engine can answer the query entirely from the index pages without touching the main table heap. This removes the costliest part of an index scan: the random I/O for each individual row fetch.
Columnar storage colocates values from the same column on disk, so aggregation queries read only the columns they need rather than full rows — dramatically reducing I/O on wide tables. Partitioning physically separates data into subdirectories (e.g., by date), allowing the query engine to skip entire partitions whose predicate cannot match, cutting scan volume from the full table to just the relevant slice.
A sargable predicate (Search ARGument ABLE) is one the engine can evaluate using an index seek — a direct traversal to the matching key range. Predicates become non-sargable when a function or implicit cast is applied to the indexed column, forcing the engine to compute a derived value for every row before comparing.